Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models

Rui Zhu; Xiaopu Zhou; Haixu Tang; Stephen W. Scherer; Lucila Ohno-Machado

Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models

Rui Zhu, Xiaopu Zhou, Haixu Tang, Stephen W. Scherer, Lucila Ohno-Machado

TL;DR

FOCUS addresses the core bottleneck of ultra-long context in DNA LLMs by introducing a k-mer–aware, trainable compression module that retains only summary KV states across fixed windows, turning the costly $O(N^2)$ attention and linear KV growth into near-linear context handling. The method inserts trainable Focus tokens after every $k$ bases, uses a shared-boundary window, and trains a lightweight adapter to summarize each $k$-mer, with memory scaling as $O(L/k)$. Empirical results on Evo-2 7B show near-lossless fidelity (average per-nucleotide change ~4×10^{-4}) and roughly 100× memory reduction, enabling ~80k-token inference windows on a single GPU. The approach is architecture-agnostic and requires no labeled data, offering practical routes to Mb-scale genomic reasoning for tasks like SV interpretation and long-range regulatory inference. The work lays groundwork for adaptive windowing and retrieval-aware multi-resolution extensions to further enhance robustness and applicability to whole-genome analyses.

Abstract

Trained on massive cross-species DNA corpora, DNA large language models (LLMs) learn the fundamental "grammar" and evolutionary patterns of genomic sequences. This makes them powerful priors for DNA sequence modeling, particularly over long ranges. However, two major constraints hinder their use in practice: the quadratic computational cost of self-attention and the growing memory required for key-value (KV) caches during autoregressive decoding. These constraints force the use of heuristics such as fixed-window truncation or sliding windows, which compromise fidelity on ultra-long sequences by discarding distant information. We introduce FOCUS (Feature-Oriented Compression for Ultra-long Self-attention), a progressive context-compression module that can be plugged into pretrained DNA LLMs. FOCUS combines the established k-mer representation in genomics with learnable hierarchical compression: it inserts summary tokens at k-mer granularity and progressively compresses attention key and value activations across multiple Transformer layers, retaining only the summary KV states across windows while discarding ordinary-token KV. A shared-boundary windowing scheme yields a stationary cross-window interface that propagates long-range information with minimal loss. We validate FOCUS on an Evo-2-based DNA LLM fine-tuned on GRCh38 chromosome 1 with self-supervised training and randomized compression schedules to promote robustness across compression ratios. On held-out human chromosomes, FOCUS achieves near-lossless fidelity: compressing a 1 kb context into only 10 summary tokens (about 100x) shifts the average per-nucleotide probability by only about 0.0004. Compared to a baseline without compression, FOCUS reduces KV-cache memory and converts effective inference scaling from O(N^2) to near-linear O(N), enabling about 100x longer inference windows on commodity GPUs with near-lossless fidelity.

Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models

TL;DR

Abstract

Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)