Compression-Based Tokenization Improves Language Modeling of Hierarchical Genomic Structure
Abstract
Tokenization is a critical design choice in genomic language modeling. Widely used schemes---character-level encoding, fixed-length $k$-mers, and greedy subword algorithms such as BPE---show intrinsic limitations on DNA that are magnified by the small four-letter alphabet. To address this, we adapt Ladderpath, an Algorithmic Information Theory method that identifies nested and hierarchical repetitions through optimal information reuse, into a tokenizer tailored for genomic sequences. Integrating this tokenizer into an 86-million-parameter Transformer yields the Ladderpath Tokenized Model (LTM), which surpasses the best existing models---including those several times larger---on 17 of 21 benchmarks. Comparisons with TF-IDF and other frequency-based baselines show that these gains extend beyond simple motif-frequency statistics. LTM's internal representations further exhibit biologically meaningful organization: token embeddings form coherent clusters, and sequence embeddings group promoters, enhancers, and histone-mark-associated regions without task-specific supervision, revealing an emergent structure of functional sequence classes. These findings show that strengthening the information-theoretic basis of tokenization provides a complementary path to architectural innovations and model scaling, enabling more compact and biologically aligned genomic foundation models.
Metrics
DOI:
Submission ID:
Downloads
Posted
How to Cite
Declaration of Competing Interests
The authors declare no competing interests to disclose.
Copyright
The copyright holder for this preprint is the author/funder.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.