Table of Contents
Fetching ...

Sampling-based Continuous Optimization for Messenger RNA Design

Feipeng Yue, Ning Dai, Wei Yu Tang, Tianshuo Zhou, David H. Mathews, Liang Huang

TL;DR

This work proposes a general sampling-based continuous optimization framework, inspired by SamplingDesign, that iteratively samples candidate synonymous sequences, evaluates them with black-box metrics, and updates a parameterized sampling distribution.

Abstract

Designing messenger RNA (mRNA) sequences for a fixed target protein requires searching an exponentially large synonymous space while optimizing properties that affect stability and downstream performance. This is challenging because practical mRNA design involves multiple coupled objectives beyond classical folding criteria, and different applications prefer different trade-offs. We propose a general sampling-based continuous optimization framework, inspired by SamplingDesign, that iteratively samples candidate synonymous sequences, evaluates them with black-box metrics, and updates a parameterized sampling distribution. Across a diverse UniProt protein set and the SARS-CoV-2 spike protein, our method consistently improves the chosen objective, with particularly strong gains on average unpaired probability and accessible uridine percentage compared to LinearDesign and EnsembleDesign. Moreover, our multi-objective COMBO formulation enables weight-controlled exploration of the design space and naturally extends to incorporate additional computable metrics.

Sampling-based Continuous Optimization for Messenger RNA Design

TL;DR

This work proposes a general sampling-based continuous optimization framework, inspired by SamplingDesign, that iteratively samples candidate synonymous sequences, evaluates them with black-box metrics, and updates a parameterized sampling distribution.

Abstract

Designing messenger RNA (mRNA) sequences for a fixed target protein requires searching an exponentially large synonymous space while optimizing properties that affect stability and downstream performance. This is challenging because practical mRNA design involves multiple coupled objectives beyond classical folding criteria, and different applications prefer different trade-offs. We propose a general sampling-based continuous optimization framework, inspired by SamplingDesign, that iteratively samples candidate synonymous sequences, evaluates them with black-box metrics, and updates a parameterized sampling distribution. Across a diverse UniProt protein set and the SARS-CoV-2 spike protein, our method consistently improves the chosen objective, with particularly strong gains on average unpaired probability and accessible uridine percentage compared to LinearDesign and EnsembleDesign. Moreover, our multi-objective COMBO formulation enables weight-controlled exploration of the design space and naturally extends to incorporate additional computable metrics.
Paper Structure (50 sections, 31 equations, 4 figures, 1 algorithm)

This paper contains 50 sections, 31 equations, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: The parameterized sampling lattice and an illustrative update--sample--evaluate iteration.(a--b) parameterized sampling lattice. To avoid enumerating the exponentially large synonymous space for a fixed protein $\boldsymbol{{p}}\xspace\xspace$, we represent all valid coding sequences as a DFA-based lattice, where each complete path corresponds to a synonymous mRNA $\boldsymbol{{x}}\xspace\xspace$. We then equip the lattice with probabilistic parameters, forming a pDFA in which each state defines a locally normalized distribution over its outgoing edges (Eq. \ref{['eq:local_norm']}). Edge labels are nucleotides, and sampling generates an mRNA by traversing the lattice and concatenating the emitted labels. Illustrative workflow. Starting from an initialized lattice (a), we sample candidate mRNAs and Evaluate them under the chosen objective (here, minimizing accessible U%). Using these scores, we perform a gradient update on the lattice parameters, yielding an updated probabilistic lattice (b), from which we sample and evaluate again. After the update, the sampled sequences exhibit a clear reduction in accessible U% (as highlighted by the cyan regions in the table), and the corresponding decision region shifts probability mass away from U-emitting branches (as highlighted by the cyan regions in the pDFA), consistent with reducing U-rich choices as one effective route to lowering accessible U%. This update--sample--evaluate loop is repeated until the optimization metrics converge.
  • Figure 2: Single-Metric Optimization on UniProt Proteins and SARS-CoV-2 spike.(a--c) Metric change relative to LinearDesign (baseline $=0$) versus protein length: (a) $\Delta\Delta G^{\circ}_{\mathrm{ens}}$ ($\mathrm{EFE}$), (b) $\Delta\mathrm{AUP}\xspace$, and (c) $\Delta\mathrm{AccessU}\xspace$. In each subfigure: red baseline: LinearDesign; Orange curve: EnsembleDesign--LinearDesign; green curve: Ours--LinearDesign. Table \ref{['tab:single']} Detailed breakdown: LinearDesign reports metric values of its sequence; EnsembleDesign and Ours report differences relative to LinearDesign.
  • Figure 3: SARS-CoV-2 spike trajectories for single-metric optimization. Panels (a)--(c) correspond to optimizing $\mathrm{EFE}$, $\mathrm{AUP}$, and $\mathrm{AccessU}$ on the spike protein, respectively. Each panel contains trajectories of sampled-sequence statistics over iterations. Subplots with blue backgrounds indicate the primary optimized metric. In a primary-metric subplot, the blue curve denotes the batch mean over sampled sequences, and the orange curve denotes the value of the best sampled sequence under the corresponding optimization objective. In the remaining subplots, the orange curve reports the other metric values of that same best-by-objective sampled sequence, and the cyan curve denotes the best sampled value of that metric within each batch.
  • Figure 4: Comparison of COMBO optimization on SARS-CoV-2 spike with prior designs in the LinearDesign design space. The spike mRNA design space is visualized in four dimensions: minimum free energy ($\mathrm{MFE}$; x-axis) and codon adaptation index ($\mathrm{CAI}$; y-axis), with point color encoding $\mathrm{AUP}$ (percent; colorbar) and circle size encoding $\mathrm{AccessU}$ (percent; size legend). The gray curve is the feasibility limit (optimal boundary) computed by LinearDesign by varying the codon-optimality weight $\lambda$ from $0$ to $\infty$. Points A--G are spike sequences designed by LinearDesign as suboptimal candidates, and H is a codon-optimized baseline designed by OptimumGene. Four reference SARS-CoV-2 spike mRNA sequences are annotated for comparison: Wildtype, BNT-162b2, mRNA-1273 (Moderna), and CV2CoV. $\mathrm{COMBO}$ results are shown as points labeled by $(\alpha,\beta,\delta)$, where $(\alpha,\beta,\gamma,\delta)$ satisfy $\alpha,\beta,\gamma,\delta\ge0$ and $\alpha+\beta+\gamma+\delta=1$ (thus $\gamma=1-\alpha-\beta-\delta$).