Alignment or Integration? Rethinking Multimodal Fusion in DNA-language Foundation Models

Yanan Li; Christina Yi Jin; Yuan Jin; Manli Luo; Tie Xu; Shuai Jiao; Wei He; Qing Zhang

Alignment or Integration? Rethinking Multimodal Fusion in DNA-language Foundation Models

Yanan Li, Christina Yi Jin, Yuan Jin, Manli Luo, Tie Xu, Shuai Jiao, Wei He, Qing Zhang

TL;DR

This work questions the optimal fusion level between genomic sequences and natural language in DNA–language foundation models. It introduces two approaches: SeqCLIP, which strengthens embedding-level alignment via contrastive pre-training, and OneVocab, which integrates DNA tokens directly into the LLM vocabulary for native token-level interaction. Across NT and KEGG benchmarks, vocabulary-level integration (OneVocab) delivers the strongest and most robust performance, particularly in reasoning tasks, while SeqCLIP improves embedding-based fusion over prior adapters. The findings advocate for native token-level fusion as a more expressive basis for DNA–language reasoning with practical implications for genomic question answering and mechanistic inference.

Abstract

Fusing DNA foundation models with large language models (LLMs) for DNA-language reasoning raises a fundamental question: at what level should genomic sequences and natural language interact? Most existing approaches encode DNA sequences and text separately and rely on embedding-level alignment to connect the two modalities. Such late-stage fusion compresses rich genomic sequences into fixed representations, limiting the model's ability to reason over fine-grained, token-level genomic structure. In this work, we propose two new methods for DNA-language fusion, i.e., a semantic alignment method SeqCLIP and a vocabulary-level integration method OneVocab. SeqCLIP strengthens embedding-level alignment via sequence-level contrastive pre-training, and OneVocab directly integrates genomic $k$-mers into the language model's existing vocabulary. Comprehensive experiments on classification and reasoning tasks show that, while various alignment strategies improve embedding-level fusion, early vocabulary-level integration yields more expressive and effective representations for DNA-language modeling.

Alignment or Integration? Rethinking Multimodal Fusion in DNA-language Foundation Models

TL;DR

Abstract

-mers into the language model's existing vocabulary. Comprehensive experiments on classification and reasoning tasks show that, while various alignment strategies improve embedding-level fusion, early vocabulary-level integration yields more expressive and effective representations for DNA-language modeling.

Paper Structure (14 sections, 6 equations, 6 figures, 4 tables)

This paper contains 14 sections, 6 equations, 6 figures, 4 tables.

Introduction
Related Work
Multimodal Fusion Paradigms in MLLMs
DNA-Language Fusion Models
The Proposed Method
Problem Formulation
SeqCLIP: Enhancing DNA-Language Alignment with Contrastive Learning
OneVocab: Vocabulary-Level Integration
Experiments
Experimental Settings
Comparison with State-of-the-Arts
Ablation Studies
Probing the Latent Space
Conclusion

Figures (6)

Figure 1: Illustration of the DNA-language reasoning task, where a DNA-language model reasons and produces a biologically meaningful answer to the textual query based on the given DNA sequences.
Figure 2: We systematically investigate three DNA-language fusion strategies, where the latter two are underexplored in the field. (a) The standard adapter-based architecture, adopted by current DNA-language models. (b) SeqCLIP: explict semantic alignment on the gene encoder by contrastively learning on massive DNA-text pairs. (c) OneVocab: extend the pre-trained LLM's vocabulary with DNA-specific tokens, allowing LLM to process them natively.
Figure 3: MCC on the 18 tasks in NT. Our model achieves superior performance compared with others on the vast majority of tasks, while exhibiting more balanced performance across tasks.
Figure 4: Average MCC across 18 NT tasks. Our two methods achieves the best.
Figure 5: One KEGG reasoning case study (full reasoning steps are provided in the supplement). The red box indicates the ground truth disease; red text marks incorrect answers, and the numbers after each method denote the semantic, logical, and completeness score.
...and 1 more figures

Alignment or Integration? Rethinking Multimodal Fusion in DNA-language Foundation Models

TL;DR

Abstract

Alignment or Integration? Rethinking Multimodal Fusion in DNA-language Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)