Gene42: Long-Range Genomic Foundation Model With Dense Attention
Kirill Vishniakov, Boulbaba Ben Amor, Engin Tekin, Nancy A. ElNaker, Karthik Viswanathan, Aleksandr Medvedev, Aahan Singh, Maryam Nadeem, Mohammad Amaan Sayeed, Praveenkumar Kanithi, Tiago Magalhaes, Natalia Vassilieva, Dwarikanath Mahapatra, Marco Pimentel, and Shadab Khan
TL;DR
Gene42 presents a dense-attention, decoder-only genomic foundation model family capable of handling up to 192,000 bp of context at single-nucleotide resolution. Through continuous pretraining starting from 4k-length baselines and a Cardan-inspired extension of RoPE, Gene42 achieves strong perplexity and reconstruction accuracy while delivering state-of-the-art performance across biotype classification, genomic benchmarks, chromatin profiling, and variant pathogenicity tasks. The model leverages a LLaMA-style architecture with character-level tokenization and pretraining on GRCh38 alongside multi-species data, enabling robust cross-species generalization. Its ultra-long context and dense attention enable precise genomic analyses, with potential impact on precision medicine and large-scale genomic interpretation.
Abstract
We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context length to 192,000 bp. This iterative extension allowed for the comprehensive processing of large-scale genomic data and the capture of intricate patterns and dependencies within the human genome. Gene42 is the first dense attention model capable of handling such extensive long context lengths in genomics, challenging state-space models that often rely on convolutional operators among other mechanisms. Our pretrained models exhibit notably low perplexity values and high reconstruction accuracy, highlighting their strong ability to model genomic data. Extensive experiments on various genomic benchmarks have demonstrated state-of-the-art performance across multiple tasks, including biotype classification, regulatory region identification, chromatin profiling prediction, variant pathogenicity prediction, and species classification. The models are publicly available at huggingface.co/inceptionai.
