Leveraging State Space Models in Long Range Genomics
Matvei Popov, Aymen Kallala, Anirudha Ramesh, Narimane Hennouni, Shivesh Khaitan, Rick Gentry, Alain-Sam Cohen
TL;DR
This work tackles the challenge of modeling ultralong genomic sequences where transformers struggle due to quadratic attention and limited extrapolation. It evaluates two state-space model architectures, Caduceus and Hawk, against a 50M-parameter transformer baseline (NTv2) on the GLRB benchmark, showing that SSMs achieve competitive performance and strong zero-shot extrapolation to much longer contexts, up to 1 million tokens on a single GPU. The findings demonstrate linear memory scaling, robust extrapolation across multiple tasks, and practical feasibility for genome-scale analyses, suggesting SSMs as a scalable alternative for long-range genomic modeling with meaningful biological representations. Overall, the work highlights the potential of SSMs to enable efficient, genome-wide analyses and deeper insights into long-range regulatory interactions and disease mechanisms.
Abstract
Long-range dependencies are critical for understanding genomic structure and function, yet most conventional methods struggle with them. Widely adopted transformer-based models, while excelling at short-context tasks, are limited by the attention module's quadratic computational complexity and inability to extrapolate to sequences longer than those seen in training. In this work, we explore State Space Models (SSMs) as a promising alternative by benchmarking two SSM-inspired architectures, Caduceus and Hawk, on long-range genomics modeling tasks under conditions parallel to a 50M parameter transformer baseline. We discover that SSMs match transformer performance and exhibit impressive zero-shot extrapolation across multiple tasks, handling contexts 10 to 100 times longer than those seen during training, indicating more generalizable representations better suited for modeling the long and complex human genome. Moreover, we demonstrate that these models can efficiently process sequences of 1M tokens on a single GPU, allowing for modeling entire genomic regions at once, even in labs with limited compute. Our findings establish SSMs as efficient and scalable for long-context genomic analysis.
