Table of Contents
Fetching ...

Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA

Lifeng Qiao, Peng Ye, Yuchen Ren, Weiqiang Bai, Chaoqi Liang, Xinzhu Ma, Nanqing Dong, Wanli Ouyang

TL;DR

MxDNA introduces a learnable DNA tokenization mechanism that optimizes token units during pretraining, addressing the unique properties of genomic sequences. By integrating a sparse Mixture of Convolution Experts with deformable convolution and cross-attention, the model discovers tokenization that can be discontinuous, overlapping, and ambiguous, while maintaining alignment with input resolution. Empirically, MxDNA achieves state-of-the-art performance on Genomic and Nucleotide Transformer Benchmarks with less pretraining data and demonstrates token-level genomic functional capture through visualization analyses. The approach provides a new perspective on DNA tokenization with potential broad applications and biological insights, albeit with limitations in direct biological validation and long-range task evaluation.

Abstract

Foundation models have made significant strides in understanding the genomic language of DNA sequences. However, previous models typically adopt the tokenization methods designed for natural language, which are unsuitable for DNA sequences due to their unique characteristics. In addition, the optimal approach to tokenize DNA remains largely under-explored, and may not be intuitively understood by humans even if discovered. To address these challenges, we introduce MxDNA, a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent. MxDNA employs a sparse Mixture of Convolution Experts coupled with a deformable convolution to model the tokenization process, with the discontinuous, overlapping, and ambiguous nature of meaningful genomic segments explicitly considered. On Nucleotide Transformer Benchmarks and Genomic Benchmarks, MxDNA demonstrates superior performance to existing methods with less pretraining data and time, highlighting its effectiveness. Finally, we show that MxDNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining. Our MxDNA aims to provide a new perspective on DNA tokenization, potentially offering broad applications in various domains and yielding profound insights.

Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA

TL;DR

MxDNA introduces a learnable DNA tokenization mechanism that optimizes token units during pretraining, addressing the unique properties of genomic sequences. By integrating a sparse Mixture of Convolution Experts with deformable convolution and cross-attention, the model discovers tokenization that can be discontinuous, overlapping, and ambiguous, while maintaining alignment with input resolution. Empirically, MxDNA achieves state-of-the-art performance on Genomic and Nucleotide Transformer Benchmarks with less pretraining data and demonstrates token-level genomic functional capture through visualization analyses. The approach provides a new perspective on DNA tokenization with potential broad applications and biological insights, albeit with limitations in direct biological validation and long-range task evaluation.

Abstract

Foundation models have made significant strides in understanding the genomic language of DNA sequences. However, previous models typically adopt the tokenization methods designed for natural language, which are unsuitable for DNA sequences due to their unique characteristics. In addition, the optimal approach to tokenize DNA remains largely under-explored, and may not be intuitively understood by humans even if discovered. To address these challenges, we introduce MxDNA, a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent. MxDNA employs a sparse Mixture of Convolution Experts coupled with a deformable convolution to model the tokenization process, with the discontinuous, overlapping, and ambiguous nature of meaningful genomic segments explicitly considered. On Nucleotide Transformer Benchmarks and Genomic Benchmarks, MxDNA demonstrates superior performance to existing methods with less pretraining data and time, highlighting its effectiveness. Finally, we show that MxDNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining. Our MxDNA aims to provide a new perspective on DNA tokenization, potentially offering broad applications in various domains and yielding profound insights.

Paper Structure

This paper contains 60 sections, 13 equations, 5 figures, 11 tables, 3 algorithms.

Figures (5)

  • Figure 1: Evolution of tokenization and Ideal Properties. Left: The progression from basic tokenization methods to more sophisticated techniques, with the direct but unsuitable applications from natural language to genomic language. Right: the ideal tokenization properties for genomics—Meaningful, Discontinuous, Overlapping, and Ambiguous—outlined in vu2023linguistically, which our MxDNA aims to achieve.
  • Figure 2: Our proposed MxDNA. (Top) Overall pipeline of the MxDNA model: Black arrows indicate pretraining data flow, and red arrows indicate finetuning data flow. The learnt tokenization module tokenizes single nucleotide input into learnt tokens. (Bottom) Illustration of the learnt tokenization module: Meaningful basic units are recognized with a linearly scoring layer and non-maximum suppression, embedded through convolution experts (Sec. \ref{['Recognition']}), and assembled into final tokens by a deformable convolution. (Sec. \ref{['Assembly']}) This process ensures meaningful, discontinuous, overlapping, and ambiguous tokenization, addressing the unique properties of genomic data.
  • Figure 3: Tokenization results of MxDNA over two individual forward passes (left) compared to those of traditional rule-based methods (right). A block of the same colour refers to a single token.
  • Figure 4: Distribution of token lengths for BPE (top) and MxDNA (bottom) across different downstream datasets, illustrating the distinct strategy of MxDNA for handling DNA tokenization. For the sake of simplicity, we regard the basic units as tokens for MxDNA.
  • Figure 5: t-SNE visualization of the output embeddings at a token level across different functional sequences of different models, demonstrating MxDNA's unique capability to inherently capture and differentiate genomic functionalities at a token level.