预印本 / 版本 1

Compression-Based Tokenization Improves Language Modeling of Hierarchical Genomic Structure

本文是预印本,尚未经过同行评审认证。

作者

    Yiping Wang,  Jing Wang,  Junhao Zhu,  Fengyao Zhai,  Hu Zhu,  Ziwei Dai,  Zengru Di,  Da Zhou,  Yu Liu
    Yu Liu
分类
关键词
Tokenization; Algorithmic Information Theory; Language Model

摘要

Tokenization is a critical design choice in genomic language modeling. Widely used schemes---character-level encoding, fixed-length $k$-mers, and greedy subword algorithms such as BPE---show intrinsic limitations on DNA that are magnified by the small four-letter alphabet. To address this, we adapt Ladderpath, an Algorithmic Information Theory method that identifies nested and hierarchical repetitions through optimal information reuse, into a tokenizer tailored for genomic sequences. Integrating this tokenizer into an 86-million-parameter Transformer yields the Ladderpath Tokenized Model (LTM), which surpasses the best existing models---including those several times larger---on 17 of 21 benchmarks. Comparisons with TF-IDF and other frequency-based baselines show that these gains extend beyond simple motif-frequency statistics. LTM's internal representations further exhibit biologically meaningful organization: token embeddings form coherent clusters, and sequence embeddings group promoters, enhancers, and histone-mark-associated regions without task-specific supervision, revealing an emergent structure of functional sequence classes. These findings show that strengthening the information-theoretic basis of tokenization provides a complementary path to architectural innovations and model scaling, enabling more compact and biologically aligned genomic foundation models.

指标

查看次数: 199
下载次数: 60

DOI:

Submission ID:

51

下载次数

已发布

2025-12-11

如何引用

Wang, Y., Wang, J., Zhu, J., Zhai, F., Zhu, H., Dai, Z., Di, Z., Zhou, D., & Liu, Y. (2025). Compression-Based Tokenization Improves Language Modeling of Hierarchical Genomic Structure. 浪淘沙预印本平台. https://doi.org/10.65215/2qt5jb81

利益冲突声明

作者声明无任何需要披露的利益冲突。