Table of Contents
Fetching ...

TrinityDNA: A Bio-Inspired Foundational Model for Efficient Long-Sequence DNA Modeling

Qirong Yang, Yucheng Guo, Zicheng Liu, Yujie Yang, Qijin Yin, Siyuan Li, Shaomin Ji, Linlin Chao, Xiaoming Zhang, Stan Z. Li

TL;DR

The TrinityDNA model provides a more accurate and efficient approach to genomic sequence modeling, offering significant improvements in gene function prediction, regulatory mechanism discovery, and other genomics applications.

Abstract

The modeling of genomic sequences presents unique challenges due to their length and structural complexity. Traditional sequence models struggle to capture long-range dependencies and biological features inherent in DNA. In this work, we propose TrinityDNA, a novel DNA foundational model designed to address these challenges. The model integrates biologically informed components, including Groove Fusion for capturing DNA's structural features and Gated Reverse Complement (GRC) to handle the inherent symmetry of DNA sequences. Additionally, we introduce a multi-scale attention mechanism that allows the model to attend to varying levels of sequence dependencies, and an evolutionary training strategy that progressively adapts the model to both prokaryotic and eukaryotic genomes. TrinityDNA provides a more accurate and efficient approach to genomic sequence modeling, offering significant improvements in gene function prediction, regulatory mechanism discovery, and other genomics applications. Our model bridges the gap between machine learning techniques and biological insights, paving the way for more effective analysis of genomic data. Additionally, we introduced a new DNA long-sequence CDS annotation benchmark to make evaluations more comprehensive and oriented toward practical applications.

TrinityDNA: A Bio-Inspired Foundational Model for Efficient Long-Sequence DNA Modeling

TL;DR

The TrinityDNA model provides a more accurate and efficient approach to genomic sequence modeling, offering significant improvements in gene function prediction, regulatory mechanism discovery, and other genomics applications.

Abstract

The modeling of genomic sequences presents unique challenges due to their length and structural complexity. Traditional sequence models struggle to capture long-range dependencies and biological features inherent in DNA. In this work, we propose TrinityDNA, a novel DNA foundational model designed to address these challenges. The model integrates biologically informed components, including Groove Fusion for capturing DNA's structural features and Gated Reverse Complement (GRC) to handle the inherent symmetry of DNA sequences. Additionally, we introduce a multi-scale attention mechanism that allows the model to attend to varying levels of sequence dependencies, and an evolutionary training strategy that progressively adapts the model to both prokaryotic and eukaryotic genomes. TrinityDNA provides a more accurate and efficient approach to genomic sequence modeling, offering significant improvements in gene function prediction, regulatory mechanism discovery, and other genomics applications. Our model bridges the gap between machine learning techniques and biological insights, paving the way for more effective analysis of genomic data. Additionally, we introduced a new DNA long-sequence CDS annotation benchmark to make evaluations more comprehensive and oriented toward practical applications.

Paper Structure

This paper contains 83 sections, 7 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview of TrinityDNA Model: (Left) The evolutionary training strategy of TrinityDNA, progressing from prokaryotic DNA to multi-species eukaryotic DNA, and its DNA-targeted long-sequence modeling approach addressing structural features such as bidirectional complementarity and major/minor grooves. (Right) Radar chart illustrating the state-of-the-art performance on the zero-shot performance of our models versus popular models such as EVO and Caduceus.
  • Figure 2: Comparison of log influential scores log $|\partial yt/\partial xs|$ versus distance $(t-s)$ on HG-38 nguyen2023hyenadna.
  • Figure 3: Average attention entropy of full self-attention models as sequence length increases.
  • Figure 4: Model Architecture of TrinityDNA: The model integrates DNA sequences and structural features by considering its grooves and reverse complementary sequence with shared parameters.
  • Figure 5: Scaling Behaviors of Our Proposed Model.(Left) Evaluation perplexity (PPL) against total FLOPs across multiple architectures, showing consistent improvements to various baselines. (Right) Impact of increasing context length (8k, 30k, 100k) on a eukaryotic dataset, where PPL steadily decreases with longer context windows.
  • ...and 5 more figures