Table of Contents
Fetching ...

Helix-mRNA: A Hybrid Foundation Model For Full Sequence mRNA Therapeutics

Matthew Wood, Mathieu Klop, Maxime Allard

TL;DR

Helix-mRNA tackles the challenge of modeling full-length mRNA sequences by combining a hybrid state-space and attention architecture with single-nucleotide encoding and a codon delimiter, enabling long-range dependency capture across UTRs and CDS. It employs a two-stage Warmup-Stable-Decay pre-training regime on a phylogenetically diverse RefSeq corpus to learn general and human-specific representations, and demonstrates superior performance across regulatory and translation-related tasks with a model of only $5.19$ million parameters, processing sequences up to $12288$ tokens (about $6x$ longer than prior methods). The results show robust cross-phyla generalization and improved 5' UTR MRL prediction, highlighting practical impact for designing mRNA vaccines and therapeutics. The work is open-source, providing accessible weights and code for broader adoption.

Abstract

mRNA-based vaccines have become a major focus in the pharmaceutical industry. The coding sequence as well as the Untranslated Regions (UTRs) of an mRNA can strongly influence translation efficiency, stability, degradation, and other factors that collectively determine a vaccine's effectiveness. However, optimizing mRNA sequences for those properties remains a complex challenge. Existing deep learning models often focus solely on coding region optimization, overlooking the UTRs. We present Helix-mRNA, a structured state-space-based and attention hybrid model to address these challenges. In addition to a first pre-training, a second pre-training stage allows us to specialise the model with high-quality data. We employ single nucleotide tokenization of mRNA sequences with codon separation, ensuring prior biological and structural information from the original mRNA sequence is not lost. Our model, Helix-mRNA, outperforms existing methods in analysing both UTRs and coding region properties. It can process sequences 6x longer than current approaches while using only 10% of the parameters of existing foundation models. Its predictive capabilities extend to all mRNA regions. We open-source the model (https://github.com/helicalAI/helical) and model weights (https://huggingface.co/helical-ai/helix-mRNA).

Helix-mRNA: A Hybrid Foundation Model For Full Sequence mRNA Therapeutics

TL;DR

Helix-mRNA tackles the challenge of modeling full-length mRNA sequences by combining a hybrid state-space and attention architecture with single-nucleotide encoding and a codon delimiter, enabling long-range dependency capture across UTRs and CDS. It employs a two-stage Warmup-Stable-Decay pre-training regime on a phylogenetically diverse RefSeq corpus to learn general and human-specific representations, and demonstrates superior performance across regulatory and translation-related tasks with a model of only million parameters, processing sequences up to tokens (about longer than prior methods). The results show robust cross-phyla generalization and improved 5' UTR MRL prediction, highlighting practical impact for designing mRNA vaccines and therapeutics. The work is open-source, providing accessible weights and code for broader adoption.

Abstract

mRNA-based vaccines have become a major focus in the pharmaceutical industry. The coding sequence as well as the Untranslated Regions (UTRs) of an mRNA can strongly influence translation efficiency, stability, degradation, and other factors that collectively determine a vaccine's effectiveness. However, optimizing mRNA sequences for those properties remains a complex challenge. Existing deep learning models often focus solely on coding region optimization, overlooking the UTRs. We present Helix-mRNA, a structured state-space-based and attention hybrid model to address these challenges. In addition to a first pre-training, a second pre-training stage allows us to specialise the model with high-quality data. We employ single nucleotide tokenization of mRNA sequences with codon separation, ensuring prior biological and structural information from the original mRNA sequence is not lost. Our model, Helix-mRNA, outperforms existing methods in analysing both UTRs and coding region properties. It can process sequences 6x longer than current approaches while using only 10% of the parameters of existing foundation models. Its predictive capabilities extend to all mRNA regions. We open-source the model (https://github.com/helicalAI/helical) and model weights (https://huggingface.co/helical-ai/helix-mRNA).

Paper Structure

This paper contains 16 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: A) Helix-mRNA hybrid architecture incorporating state-based and attention-based approaches. We show how we retain coding region structure with an additional token highlighted in the figure. B) Embeddings from initial pre-training, generated using only coding regions from mRNA sequences not seen during training.
  • Figure 2: Benchmark comparison between Helix-mRNA and Optimus 5-Prime on task specific fine-tuning to predict Mean Ribosome Load (MRL) across 3 cell lines (HEK293T, T cells, and HepG2) using two replicates, reproduced from the Optimus 5-Prime codebase released with the paper castillo2024optimus. Results show the $r^2$ correlation values between the predicted MRL and the true MRL.
  • Figure 3: Helix-mRNA embeddings from initial pre-training, generated using full mRNA sequences including both the coding and untranslated regions.