Compression-Based Tokenization Improves Language Modeling of Hierarchical Genomic Structure

Yiping Wang; Jing Wang; Junhao Zhu; Fengyao Zhai; Hu Zhu; Ziwei Dai; Zengru Di; Da Zhou; Yu Liu

doi:10.65215/2qt5jb81

Preprint / Version 1

Compression-Based Tokenization Improves Language Modeling of Hierarchical Genomic Structure

This article is a preprint and has not been certified by peer review.

Authors

Abstract

Tokenization is a critical design choice in genomic language modeling. Widely used schemes---character-level encoding, fixed-length $k$-mers, and greedy subword algorithms such as BPE---show intrinsic limitations on DNA that are magnified by the small four-letter alphabet. To address this, we adapt Ladderpath, an Algorithmic Information Theory method that identifies nested and hierarchical repetitions through optimal information reuse, into a tokenizer tailored for genomic sequences. Integrating this tokenizer into an 86-million-parameter Transformer yields the Ladderpath Tokenized Model (LTM), which surpasses the best existing models---including those several times larger---on 17 of 21 benchmarks. Comparisons with TF-IDF and other frequency-based baselines show that these gains extend beyond simple motif-frequency statistics. LTM's internal representations further exhibit biologically meaningful organization: token embeddings form coherent clusters, and sequence embeddings group promoters, enhancers, and histone-mark-associated regions without task-specific supervision, revealing an emergent structure of functional sequence classes. These findings show that strengthening the information-theoretic basis of tokenization provides a complementary path to architectural innovations and model scaling, enabling more compact and biologically aligned genomic foundation models.

Metrics

Favorites: 1

Downloads: 105

DOI:

https://doi.org/10.65215/2qt5jb81

Submission ID:

Downloads

Posted

2025-12-11

How to Cite

Wang, Y., Wang, J., Zhu, J., Zhai, F., Zhu, H., Dai, Z., Di, Z., Zhou, D., & Liu, Y. (2025). Compression-Based Tokenization Improves Language Modeling of Hierarchical Genomic Structure. LangTaoSha Preprint Server. https://doi.org/10.65215/2qt5jb81

Download Citation

Declaration of Competing Interests

The authors declare no competing interests to disclose.

Copyright

The copyright holder for this preprint is the author/funder.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.