Table of Contents
Fetching ...

MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

Siyuan Li, Kai Yu, Anna Wang, Zicheng Liu, Chang Yu, Jingbo Zhou, Qirong Yang, Yucheng Guo, Xiaoming Zhang, Stan Z. Li

TL;DR

MergeDNA tackles the challenge of genome-scale sequence modeling by jointly learning a context-aware, differentiable tokenizer and a long-range Transformer. The model combines a Local Encoder with token merging to produce variable-length tokens and a Latent Transformer for global context, with Latent/Local decoders enabling reconstruction. Two pre-training objectives—Merged Token Reconstruction and Adaptive Masked Token Modeling—drive adaptive token granularity and selective masking of informative regions. Across genomic benchmarks and multi-omics tasks, MergeDNA achieves state-of-the-art performance, demonstrating robust generalization across species and modalities and offering a scalable approach to genome-scale representation learning.

Abstract

Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences. Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokenization module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks. Symmetrically employing a Latent Decoder and a Local Decoder, MergeDNA learns with two pre-training tasks: Merged Token Reconstruction simultaneously trains the dynamic tokenization module and adaptively filters important tokens, while Adaptive Masked Token Modeling learns to predict these filtered tokens to capture informative contents. Extensive experiments show that MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation, outperforming typical tokenization methods and large-scale DNA foundation models.

MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

TL;DR

MergeDNA tackles the challenge of genome-scale sequence modeling by jointly learning a context-aware, differentiable tokenizer and a long-range Transformer. The model combines a Local Encoder with token merging to produce variable-length tokens and a Latent Transformer for global context, with Latent/Local decoders enabling reconstruction. Two pre-training objectives—Merged Token Reconstruction and Adaptive Masked Token Modeling—drive adaptive token granularity and selective masking of informative regions. Across genomic benchmarks and multi-omics tasks, MergeDNA achieves state-of-the-art performance, demonstrating robust generalization across species and modalities and offering a scalable approach to genome-scale representation learning.

Abstract

Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences. Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokenization module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks. Symmetrically employing a Latent Decoder and a Local Decoder, MergeDNA learns with two pre-training tasks: Merged Token Reconstruction simultaneously trains the dynamic tokenization module and adaptively filters important tokens, while Adaptive Masked Token Modeling learns to predict these filtered tokens to capture informative contents. Extensive experiments show that MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation, outperforming typical tokenization methods and large-scale DNA foundation models.

Paper Structure

This paper contains 53 sections, 8 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Overview of MergeDNA architecture. The Local Encoder & Decoder achieves adaptive DNA tokenization, while the Latent Encoder & Decoder learn contextual information with informative token masked modeling.
  • Figure 2: Pre-training of MergeDNA for (a) Local Encoder & Decoder and (b) Latent Encoder & Decoder.
  • Figure 3: Visualization of Token Length Distributions for (a) BPE iclr2024dnabert2, (b) MxDNA nips2024MXDNA, and (c) MergeDNA across different genomic contexts. Baseline tokenizers show a static, context-agnostic distribution, while MergeDNA adaptively changes its tokenization strategy based on the sequence type, demonstrating strong context-awareness.