Table of Contents
Fetching ...

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

Arnav Shah, Junzhe Li, Parsa Idehpour, Adibvafa Fallahpour, Brandon Wang, Sukjun Hwang, Bo Wang, Patrick D. Hsu, Hani Goodarzi, Albert Gu

TL;DR

We address the trade-off between fixed-tokenization efficiency and nucleotide-level biological fidelity in genomic foundation models by introducing dnaHNet, a tokenizer-free autoregressive model with differentiable dynamic chunking. The model learns to compress raw nucleotides into latent tokens through a recursive Encoder–Main Network–Decoder architecture, achieving quadratic FLOP reductions and over $3×$ faster inference than Transformer-based baselines. Pretrained on 144B nucleotides from 85,205 prokaryotic genomes (GTDB subset), it attains state-of-the-art zero-shot performance on protein variant effect prediction and gene essentiality, while automatically uncovering hierarchical biological structure such as codon triplets and functional regions. These results demonstrate scalable, interpretable genomic modeling with data-efficient training regimes that deviate from standard scaling laws, suggesting strong potential for future genome-scale design tasks and integration with protein-language models.

Abstract

Genomic foundation models have the potential to decode DNA syntax, yet face a fundamental tradeoff in their input representation. Standard fixed-vocabulary tokenizers fragment biologically meaningful motifs such as codons and regulatory elements, while nucleotide-level models preserve biological coherence but incur prohibitive computational costs for long contexts. We introduce dnaHNet, a state-of-the-art tokenizer-free autoregressive model that segments and models genomic sequences end-to-end. Using a differentiable dynamic chunking mechanism, dnaHNet compresses raw nucleotides into latent tokens adaptively, balancing compression with predictive accuracy. Pretrained on prokaryotic genomes, dnaHNet outperforms leading architectures including StripedHyena2 in scaling and efficiency. This recursive chunking yields quadratic FLOP reductions, enabling $>3 \times$ inference speedup over Transformers. On zero-shot tasks, dnaHNet achieves superior performance in predicting protein variant fitness and gene essentiality, while automatically discovering hierarchical biological structures without supervision. These results establish dnaHNet as a scalable, interpretable framework for next-generation genomic modeling.

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

TL;DR

We address the trade-off between fixed-tokenization efficiency and nucleotide-level biological fidelity in genomic foundation models by introducing dnaHNet, a tokenizer-free autoregressive model with differentiable dynamic chunking. The model learns to compress raw nucleotides into latent tokens through a recursive Encoder–Main Network–Decoder architecture, achieving quadratic FLOP reductions and over faster inference than Transformer-based baselines. Pretrained on 144B nucleotides from 85,205 prokaryotic genomes (GTDB subset), it attains state-of-the-art zero-shot performance on protein variant effect prediction and gene essentiality, while automatically uncovering hierarchical biological structure such as codon triplets and functional regions. These results demonstrate scalable, interpretable genomic modeling with data-efficient training regimes that deviate from standard scaling laws, suggesting strong potential for future genome-scale design tasks and integration with protein-language models.

Abstract

Genomic foundation models have the potential to decode DNA syntax, yet face a fundamental tradeoff in their input representation. Standard fixed-vocabulary tokenizers fragment biologically meaningful motifs such as codons and regulatory elements, while nucleotide-level models preserve biological coherence but incur prohibitive computational costs for long contexts. We introduce dnaHNet, a state-of-the-art tokenizer-free autoregressive model that segments and models genomic sequences end-to-end. Using a differentiable dynamic chunking mechanism, dnaHNet compresses raw nucleotides into latent tokens adaptively, balancing compression with predictive accuracy. Pretrained on prokaryotic genomes, dnaHNet outperforms leading architectures including StripedHyena2 in scaling and efficiency. This recursive chunking yields quadratic FLOP reductions, enabling inference speedup over Transformers. On zero-shot tasks, dnaHNet achieves superior performance in predicting protein variant fitness and gene essentiality, while automatically discovering hierarchical biological structures without supervision. These results establish dnaHNet as a scalable, interpretable framework for next-generation genomic modeling.
Paper Structure (39 sections, 4 equations, 6 figures, 6 tables)

This paper contains 39 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: dnaHNet Architecture. Raw nucleotide sequences are processed by the Encoder (E), which learns segmentation boundaries via a differentiable chunking mechanism. The compressed latent sequence is modeled by the Main Network (M), then upsampled by the Decoder (D) to produce next-nucleotide predictions. The architecture can be applied recursively for multi-level compression.
  • Figure 2: Inference FLOPs. (Left) Total inference FLOPs versus sequence length. At $10^6$ nucleotides, dnaHNet ($218M$) requires $3.89 \times$ fewer FLOPs than StripedHyena2 ($166M$). (Right) FLOPs per token across sequence lengths. Hierarchical compression enables dnaHNet to achieve lower per-token costs than both linear-scaling baselines and theoretical $O(n)$ and $O(n^2)$ references.
  • Figure 3: Evaluation perplexity scaling. Evaluation perplexity versus training FLOPs for compute-optimal configurations. dnaHNet achieves a scaling exponent of $\alpha = 0.06$ compared to $\alpha = 0.04$ for StripedHyena2 and $\alpha = 0.01$ for Transformers, demonstrating superior compute efficiency across the tested range.
  • Figure 4: Protein VEP Results.(A) Schematic of the zero-shot scoring method, using language model likelihood of mutated coding sequences to predict experimental fitness. (B) Absolute Spearman correlation on MaveDB benchmarks versus training FLOPs. dnaHNet consistently achieves higher correlation than StripedHyena2 (SH2) and Transformer baselines across all compute budgets.
  • Figure 5: Gene Essentiality Prediction.(A) Schematic of the in silico perturbation task. Gene essentiality is predicted by comparing wild-type likelihood against a variant with inserted premature stop codons. (B) Classification AUROC on DEG versus training FLOPs. Both dnaHNet configurations outperform StripedHyena2, with the (3,2) hierarchy demonstrating strongest scaling, hypothesized to be due to matching the underlying biological structure of codons in the first layer.
  • ...and 1 more figures