Table of Contents
Fetching ...

Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models

Rui Zhu, Xiaopu Zhou, Haixu Tang, Stephen W. Scherer, Lucila Ohno-Machado

TL;DR

FOCUS addresses the core bottleneck of ultra-long context in DNA LLMs by introducing a k-mer–aware, trainable compression module that retains only summary KV states across fixed windows, turning the costly $O(N^2)$ attention and linear KV growth into near-linear context handling. The method inserts trainable Focus tokens after every $k$ bases, uses a shared-boundary window, and trains a lightweight adapter to summarize each $k$-mer, with memory scaling as $O(L/k)$. Empirical results on Evo-2 7B show near-lossless fidelity (average per-nucleotide change ~4×10^{-4}) and roughly 100× memory reduction, enabling ~80k-token inference windows on a single GPU. The approach is architecture-agnostic and requires no labeled data, offering practical routes to Mb-scale genomic reasoning for tasks like SV interpretation and long-range regulatory inference. The work lays groundwork for adaptive windowing and retrieval-aware multi-resolution extensions to further enhance robustness and applicability to whole-genome analyses.

Abstract

Trained on massive cross-species DNA corpora, DNA large language models (LLMs) learn the fundamental "grammar" and evolutionary patterns of genomic sequences. This makes them powerful priors for DNA sequence modeling, particularly over long ranges. However, two major constraints hinder their use in practice: the quadratic computational cost of self-attention and the growing memory required for key-value (KV) caches during autoregressive decoding. These constraints force the use of heuristics such as fixed-window truncation or sliding windows, which compromise fidelity on ultra-long sequences by discarding distant information. We introduce FOCUS (Feature-Oriented Compression for Ultra-long Self-attention), a progressive context-compression module that can be plugged into pretrained DNA LLMs. FOCUS combines the established k-mer representation in genomics with learnable hierarchical compression: it inserts summary tokens at k-mer granularity and progressively compresses attention key and value activations across multiple Transformer layers, retaining only the summary KV states across windows while discarding ordinary-token KV. A shared-boundary windowing scheme yields a stationary cross-window interface that propagates long-range information with minimal loss. We validate FOCUS on an Evo-2-based DNA LLM fine-tuned on GRCh38 chromosome 1 with self-supervised training and randomized compression schedules to promote robustness across compression ratios. On held-out human chromosomes, FOCUS achieves near-lossless fidelity: compressing a 1 kb context into only 10 summary tokens (about 100x) shifts the average per-nucleotide probability by only about 0.0004. Compared to a baseline without compression, FOCUS reduces KV-cache memory and converts effective inference scaling from O(N^2) to near-linear O(N), enabling about 100x longer inference windows on commodity GPUs with near-lossless fidelity.

Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models

TL;DR

FOCUS addresses the core bottleneck of ultra-long context in DNA LLMs by introducing a k-mer–aware, trainable compression module that retains only summary KV states across fixed windows, turning the costly attention and linear KV growth into near-linear context handling. The method inserts trainable Focus tokens after every bases, uses a shared-boundary window, and trains a lightweight adapter to summarize each -mer, with memory scaling as . Empirical results on Evo-2 7B show near-lossless fidelity (average per-nucleotide change ~4×10^{-4}) and roughly 100× memory reduction, enabling ~80k-token inference windows on a single GPU. The approach is architecture-agnostic and requires no labeled data, offering practical routes to Mb-scale genomic reasoning for tasks like SV interpretation and long-range regulatory inference. The work lays groundwork for adaptive windowing and retrieval-aware multi-resolution extensions to further enhance robustness and applicability to whole-genome analyses.

Abstract

Trained on massive cross-species DNA corpora, DNA large language models (LLMs) learn the fundamental "grammar" and evolutionary patterns of genomic sequences. This makes them powerful priors for DNA sequence modeling, particularly over long ranges. However, two major constraints hinder their use in practice: the quadratic computational cost of self-attention and the growing memory required for key-value (KV) caches during autoregressive decoding. These constraints force the use of heuristics such as fixed-window truncation or sliding windows, which compromise fidelity on ultra-long sequences by discarding distant information. We introduce FOCUS (Feature-Oriented Compression for Ultra-long Self-attention), a progressive context-compression module that can be plugged into pretrained DNA LLMs. FOCUS combines the established k-mer representation in genomics with learnable hierarchical compression: it inserts summary tokens at k-mer granularity and progressively compresses attention key and value activations across multiple Transformer layers, retaining only the summary KV states across windows while discarding ordinary-token KV. A shared-boundary windowing scheme yields a stationary cross-window interface that propagates long-range information with minimal loss. We validate FOCUS on an Evo-2-based DNA LLM fine-tuned on GRCh38 chromosome 1 with self-supervised training and randomized compression schedules to promote robustness across compression ratios. On held-out human chromosomes, FOCUS achieves near-lossless fidelity: compressing a 1 kb context into only 10 summary tokens (about 100x) shifts the average per-nucleotide probability by only about 0.0004. Compared to a baseline without compression, FOCUS reduces KV-cache memory and converts effective inference scaling from O(N^2) to near-linear O(N), enabling about 100x longer inference windows on commodity GPUs with near-lossless fidelity.

Paper Structure

This paper contains 46 sections, 21 equations, 6 figures.

Figures (6)

  • Figure 1: Focus at $k$-mer granularity with sliding windows. Illustrative example with $k{=}3$ and window size $W{=}4$. After every $k$ ordinary bases, a learnable Focus token (gray square) is inserted and, through a dedicated attention module, summarizes the immediately preceding $k$-mer into a compact Focus vector. Generation proceeds in fixed windows ($W$): only Focus states are retained and carried across windows, while ordinary-token states are not kept. Legend: layerwise attention vectors (striped), Focus vector (dotted gray), window (outlined blocks).
  • Figure 2: In-distribution compression fidelity on GRCh38 (excluding Chromosome 1). For Chr2--Chr22, X, and Y, we evaluate Focus--Evo-2 7B against the baseline Evo-2 7B on $500$ random $1024$ bp segments per chromosome. Histograms show the distribution of L1, L2, Hellinger, Jensen--Shannon, and $\mathrm{KL}$ across all positions; annotations mark the median and IQR. Most mass concentrates near zero, indicating high-fidelity compression.
  • Figure 3: In-distribution compression fidelity on OpenGenome2.
  • Figure 4: Out-of-distribution compression fidelity on MSL39 viruses sequences.
  • Figure 5: Sequence length vs. L2 discrepancy, and effects of $W$ and $k$. Left: median per-position L2 on GRCh38 for a model trained with $W{=}1024,k{=}100$ (blue) and a retrained model with $W{=}2048$ (red). Right: same protocol with $W{=}1024$ while changing $k$ from $100$ (blue) to $50$ (red).
  • ...and 1 more figures