Table of Contents
Fetching ...

Leveraging State Space Models in Long Range Genomics

Matvei Popov, Aymen Kallala, Anirudha Ramesh, Narimane Hennouni, Shivesh Khaitan, Rick Gentry, Alain-Sam Cohen

TL;DR

This work tackles the challenge of modeling ultralong genomic sequences where transformers struggle due to quadratic attention and limited extrapolation. It evaluates two state-space model architectures, Caduceus and Hawk, against a 50M-parameter transformer baseline (NTv2) on the GLRB benchmark, showing that SSMs achieve competitive performance and strong zero-shot extrapolation to much longer contexts, up to 1 million tokens on a single GPU. The findings demonstrate linear memory scaling, robust extrapolation across multiple tasks, and practical feasibility for genome-scale analyses, suggesting SSMs as a scalable alternative for long-range genomic modeling with meaningful biological representations. Overall, the work highlights the potential of SSMs to enable efficient, genome-wide analyses and deeper insights into long-range regulatory interactions and disease mechanisms.

Abstract

Long-range dependencies are critical for understanding genomic structure and function, yet most conventional methods struggle with them. Widely adopted transformer-based models, while excelling at short-context tasks, are limited by the attention module's quadratic computational complexity and inability to extrapolate to sequences longer than those seen in training. In this work, we explore State Space Models (SSMs) as a promising alternative by benchmarking two SSM-inspired architectures, Caduceus and Hawk, on long-range genomics modeling tasks under conditions parallel to a 50M parameter transformer baseline. We discover that SSMs match transformer performance and exhibit impressive zero-shot extrapolation across multiple tasks, handling contexts 10 to 100 times longer than those seen during training, indicating more generalizable representations better suited for modeling the long and complex human genome. Moreover, we demonstrate that these models can efficiently process sequences of 1M tokens on a single GPU, allowing for modeling entire genomic regions at once, even in labs with limited compute. Our findings establish SSMs as efficient and scalable for long-context genomic analysis.

Leveraging State Space Models in Long Range Genomics

TL;DR

This work tackles the challenge of modeling ultralong genomic sequences where transformers struggle due to quadratic attention and limited extrapolation. It evaluates two state-space model architectures, Caduceus and Hawk, against a 50M-parameter transformer baseline (NTv2) on the GLRB benchmark, showing that SSMs achieve competitive performance and strong zero-shot extrapolation to much longer contexts, up to 1 million tokens on a single GPU. The findings demonstrate linear memory scaling, robust extrapolation across multiple tasks, and practical feasibility for genome-scale analyses, suggesting SSMs as a scalable alternative for long-range genomic modeling with meaningful biological representations. Overall, the work highlights the potential of SSMs to enable efficient, genome-wide analyses and deeper insights into long-range regulatory interactions and disease mechanisms.

Abstract

Long-range dependencies are critical for understanding genomic structure and function, yet most conventional methods struggle with them. Widely adopted transformer-based models, while excelling at short-context tasks, are limited by the attention module's quadratic computational complexity and inability to extrapolate to sequences longer than those seen in training. In this work, we explore State Space Models (SSMs) as a promising alternative by benchmarking two SSM-inspired architectures, Caduceus and Hawk, on long-range genomics modeling tasks under conditions parallel to a 50M parameter transformer baseline. We discover that SSMs match transformer performance and exhibit impressive zero-shot extrapolation across multiple tasks, handling contexts 10 to 100 times longer than those seen during training, indicating more generalizable representations better suited for modeling the long and complex human genome. Moreover, we demonstrate that these models can efficiently process sequences of 1M tokens on a single GPU, allowing for modeling entire genomic regions at once, even in labs with limited compute. Our findings establish SSMs as efficient and scalable for long-context genomic analysis.

Paper Structure

This paper contains 19 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Comparison of the extrapolation methods of state-space models and attention-based models on VEP eQTLs (AUROC). For NTv2, we also reported an inference-time extrapolation method: position interpolation. A dotted vertical line indicates the fine-tuning sequence length (12 kbp) of all models. Attention-based models collapse when processing sequences that are longer than what they have encountered at training time, whereas state-space models show an ability to generalize to sequences up to 10x longer. Lines that turn into dotted indicate values that we were unable to compute due to computational cost constraints and are therefore assumed based on trends.
  • Figure 2: Zero-shot extrapolation results of state space models across the 6 tasks of the GLRB. Dotted vertical lines indicate the fine-tuning sequence length (12 kbp).
  • Figure 3: A Mechanism for Hidden State Propagation in SSMs for Ultralong Sequences Visualized. An ultralong sequence is split into multiple chunks, thereby doing a linear scan over chunks. An individual chunk size could be set to any size that fits on a single GPU. The hidden state's size always stays fixed.
  • Figure 4: Zero-shot extrapolation on VEP ClinVar and VEP eQTL with Hawk (50M parameters) up to 1 Mbp input length. Performance remains stable despite the substantial increase in context size, indicating strong scalability.
  • Figure 5: MLM validation loss as a function of training step for different sequence lengths (2 kbp, 4 kbp, 8 kbp, 12 kbp, 24 kbp, 48 kbp, 96 kbp, and 128 kbp). A dashed vertical line indicates the point at which training with 12 kbp sequences begins (the 140,000th training step). The y-axis shows the MLM loss, while the x-axis denotes training steps. Although the model continues training on a fixed 12 kbp context after this point, we measure validation loss across multiple lengths to assess generalization and extrapolation. The curves remain closely clustered, indicating that the model maintains comparable loss values even as the sequence length changes significantly.
  • ...and 1 more figures