Preprint / Version 1

Compression-Based Tokenization Improves Language Modeling of Hierarchical Genomic Structure

This article is a preprint and has not been certified by peer review.

Authors

    Yiping Wang,   Jing Wang,   Junhao Zhu,   Fengyao Zhai,   Hu Zhu,   Ziwei Dai,   Zengru Di,   Da Zhou,   Yu Liu
    Yu Liu
Categories
Keywords
Tokenization; Algorithmic Information Theory; Language Model

Abstract

Tokenization is a critical design choice in genomic language modeling. Widely used schemes---character-level encoding, fixed-length $k$-mers, and greedy subword algorithms such as BPE---show intrinsic limitations on DNA that are magnified by the small four-letter alphabet. To address this, we adapt Ladderpath, an Algorithmic Information Theory method that identifies nested and hierarchical repetitions through optimal information reuse, into a tokenizer tailored for genomic sequences. Integrating this tokenizer into an 86-million-parameter Transformer yields the Ladderpath Tokenized Model (LTM), which surpasses the best existing models---including those several times larger---on 17 of 21 benchmarks. Comparisons with TF-IDF and other frequency-based baselines show that these gains extend beyond simple motif-frequency statistics. LTM's internal representations further exhibit biologically meaningful organization: token embeddings form coherent clusters, and sequence embeddings group promoters, enhancers, and histone-mark-associated regions without task-specific supervision, revealing an emergent structure of functional sequence classes. These findings show that strengthening the information-theoretic basis of tokenization provides a complementary path to architectural innovations and model scaling, enabling more compact and biologically aligned genomic foundation models.

Metrics

Views: 198
Downloads: 60

Downloads

Posted

2025-12-11

How to Cite

Wang, Y., Wang, J., Zhu, J., Zhai, F., Zhu, H., Dai, Z., Di, Z., Zhou, D., & Liu, Y. (2025). Compression-Based Tokenization Improves Language Modeling of Hierarchical Genomic Structure. LangTaoSha Preprint Server. https://doi.org/10.65215/2qt5jb81

Declaration of Competing Interests

The authors declare no competing interests to disclose.