Helix-mRNA: A Hybrid Foundation Model For Full Sequence mRNA Therapeutics
Matthew Wood, Mathieu Klop, Maxime Allard
TL;DR
Helix-mRNA tackles the challenge of modeling full-length mRNA sequences by combining a hybrid state-space and attention architecture with single-nucleotide encoding and a codon delimiter, enabling long-range dependency capture across UTRs and CDS. It employs a two-stage Warmup-Stable-Decay pre-training regime on a phylogenetically diverse RefSeq corpus to learn general and human-specific representations, and demonstrates superior performance across regulatory and translation-related tasks with a model of only $5.19$ million parameters, processing sequences up to $12288$ tokens (about $6x$ longer than prior methods). The results show robust cross-phyla generalization and improved 5' UTR MRL prediction, highlighting practical impact for designing mRNA vaccines and therapeutics. The work is open-source, providing accessible weights and code for broader adoption.
Abstract
mRNA-based vaccines have become a major focus in the pharmaceutical industry. The coding sequence as well as the Untranslated Regions (UTRs) of an mRNA can strongly influence translation efficiency, stability, degradation, and other factors that collectively determine a vaccine's effectiveness. However, optimizing mRNA sequences for those properties remains a complex challenge. Existing deep learning models often focus solely on coding region optimization, overlooking the UTRs. We present Helix-mRNA, a structured state-space-based and attention hybrid model to address these challenges. In addition to a first pre-training, a second pre-training stage allows us to specialise the model with high-quality data. We employ single nucleotide tokenization of mRNA sequences with codon separation, ensuring prior biological and structural information from the original mRNA sequence is not lost. Our model, Helix-mRNA, outperforms existing methods in analysing both UTRs and coding region properties. It can process sequences 6x longer than current approaches while using only 10% of the parameters of existing foundation models. Its predictive capabilities extend to all mRNA regions. We open-source the model (https://github.com/helicalAI/helical) and model weights (https://huggingface.co/helical-ai/helix-mRNA).
