Table of Contents
Fetching ...

Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis

Mohammadsaleh Refahi, Mahdi Abavisani, Bahrad A. Sokhansanj, James R. Brown, Gail Rosen

TL;DR

CARMANIA introduces a context-aware pretraining framework that augments next-token prediction with a Transition Matrix loss and uses sliding-window attention to model long genomic sequences up to 160 kbp. By aligning local token transitions with empirically derived bigram statistics via a KL-based TM loss, the approach enforces global sequence consistency while maintaining computational efficiency. The method achieves consistent improvements across 40 genomic tasks, including notable gains in enhancer MCC and AMR classification, and demonstrates robust long-range retention and domain adaptation. This work advances practical, scalable Transformer-based genomic Modeling by integrating Markovian priors with efficient attention, enabling improved interpretation and prediction in diverse biological contexts.

Abstract

Transformers have revolutionized nucleotide sequence analysis, yet capturing long-range dependencies remains challenging. Recent studies show that autoregressive transformers often exhibit Markovian behavior by relying on fixed-length context windows for next-token prediction. However, standard self-attention mechanisms are computationally inefficient for long sequences due to their quadratic complexity and do not explicitly enforce global transition consistency. We introduce CARMANIA (Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis), a self-supervised pretraining framework that augments next-token (NT) prediction with a transition-matrix (TM) loss. The TM loss aligns predicted token transitions with empirically derived n-gram statistics from each input sequence, encouraging the model to capture higher-order dependencies beyond local context. This integration enables CARMANIA to learn organism-specific sequence structures that reflect both evolutionary constraints and functional organization. We evaluate CARMANIA across diverse genomic tasks, including regulatory element prediction, functional gene classification, taxonomic inference, antimicrobial resistance detection, and biosynthetic gene cluster classification. CARMANIA outperforms the previous best long-context model by at least 7 percent, matches state-of-the-art on shorter sequences (exceeding prior results on 20 out of 40 tasks while running approximately 2.5 times faster), and shows particularly strong improvements on enhancer and housekeeping gene classification tasks, including up to a 34 percent absolute gain in Matthews correlation coefficient (MCC) for enhancer prediction. The TM loss boosts accuracy in 33 of 40 tasks, especially where local motifs or regulatory patterns drive prediction.

Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis

TL;DR

CARMANIA introduces a context-aware pretraining framework that augments next-token prediction with a Transition Matrix loss and uses sliding-window attention to model long genomic sequences up to 160 kbp. By aligning local token transitions with empirically derived bigram statistics via a KL-based TM loss, the approach enforces global sequence consistency while maintaining computational efficiency. The method achieves consistent improvements across 40 genomic tasks, including notable gains in enhancer MCC and AMR classification, and demonstrates robust long-range retention and domain adaptation. This work advances practical, scalable Transformer-based genomic Modeling by integrating Markovian priors with efficient attention, enabling improved interpretation and prediction in diverse biological contexts.

Abstract

Transformers have revolutionized nucleotide sequence analysis, yet capturing long-range dependencies remains challenging. Recent studies show that autoregressive transformers often exhibit Markovian behavior by relying on fixed-length context windows for next-token prediction. However, standard self-attention mechanisms are computationally inefficient for long sequences due to their quadratic complexity and do not explicitly enforce global transition consistency. We introduce CARMANIA (Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis), a self-supervised pretraining framework that augments next-token (NT) prediction with a transition-matrix (TM) loss. The TM loss aligns predicted token transitions with empirically derived n-gram statistics from each input sequence, encouraging the model to capture higher-order dependencies beyond local context. This integration enables CARMANIA to learn organism-specific sequence structures that reflect both evolutionary constraints and functional organization. We evaluate CARMANIA across diverse genomic tasks, including regulatory element prediction, functional gene classification, taxonomic inference, antimicrobial resistance detection, and biosynthetic gene cluster classification. CARMANIA outperforms the previous best long-context model by at least 7 percent, matches state-of-the-art on shorter sequences (exceeding prior results on 20 out of 40 tasks while running approximately 2.5 times faster), and shows particularly strong improvements on enhancer and housekeeping gene classification tasks, including up to a 34 percent absolute gain in Matthews correlation coefficient (MCC) for enhancer prediction. The TM loss boosts accuracy in 33 of 40 tasks, especially where local motifs or regulatory patterns drive prediction.

Paper Structure

This paper contains 32 sections, 15 equations, 8 figures, 18 tables.

Figures (8)

  • Figure 1: Proposed Pretraining Framework. We extend a LLaMA-style decoder with a transition matrix module to capture global nucleotide co-occurrence. The model uses sliding-window attention with rotary embeddings and local caching, reducing attention complexity from $O(n^2)$ to $O(n)$. The transition matrix complements local attention by preserving long-range dependencies efficiently.
  • Figure 2: Comparison of training losses with and without explicit TM loss enforcement ($\beta=1$ vs. $\beta=0$). Left: Both models show similar reductions in next-token prediction loss. Right: Without TM loss ($\beta=0$), the model partially learns transition structure—TM loss rises then stabilizes—indicating implicit alignment with n-gram patterns.
  • Figure 3: Left: Effect of window size on model performance. A window size of 128 achieves results comparable to full attention. Right:Inference time per sequence for CARMANIA(83M), HyenaDNA(1.6M), and Caduceus-PH(1.9M) across varying sequence lengths.
  • Figure 4: Heatmap of average sequence similarity across 50 independent 160 kbp genomic segments in 100 bp windows at 2000 bp intervals.Left:CARMANIA consistently maintains high similarity across all regions, demonstrating superior long-range memory, whereas HyenaDNA shows reduced sequence coherence in later segments. Right: Incorporating the TM loss improves memory retention in CARMANIA, yielding higher sequence similarity across extended contexts.
  • Figure 5: t-SNE visualization of the 10 most common genes in the Scorpio-Gene-Taxa dataset. CARMANIA effectively clusters genes while maintaining taxonomic coherence, leading to superior gene-to-taxonomy classification performance.
  • ...and 3 more figures