Designing RNAs with Language Models

Milan Gautam; Ning Dai; Tianshuo Zhou; Bowen Xie; David Mathews; Liang Huang

Designing RNAs with Language Models

Milan Gautam, Ning Dai, Tianshuo Zhou, Bowen Xie, David Mathews, Liang Huang

TL;DR

This work reframes RNA design as conditional sequence generation by conditioning a decoder-only language model on target dot-bracket structures and enforcing base-pairing constraints with a constrained decoding scheme. Starting from a pretrained decoder, the authors perform a minimal RNA-focused adaptation and train via supervised learning on solver-generated structure–design pairs, followed by reinforcement learning to optimize thermodynamics-based rewards. The approach achieves state-of-the-art ensemble metrics across four RNA-design benchmarks while being substantially faster than per-instance optimization, demonstrating the viability of amortized neural solvers for RNA inverse folding. The release of carefully constructed SL and RL datasets and the demonstrated efficiency gains suggest broad applicability of structure-conditioned LMs for scalable RNA design and related design tasks.

Abstract

RNA design, the task of finding a sequence that folds into a target secondary structure, has broad biological and biomedical impact but remains computationally challenging due to the exponentially large sequence space and exponentially many competing folds. Traditional approaches treat it as an optimization problem, relying on per-instance heuristics or constraint-based search. We instead reframe RNA design as conditional sequence generation and introduce a reusable neural approximator, instantiated as an autoregressive language model (LM), that maps target structures directly to sequences. We first train our model in a supervised setting on random-induced structure-sequence pairs, and then use reinforcement learning (RL) to optimize end-to-end metrics. We also propose methods to select a small subset for RL that greatly improves RL efficiency and quality. Across four datasets, our approach outperforms state-of-the-art systems on key metrics such as Boltzmann probability while being 1.7x faster, establishing conditional LM generation as a scalable, task-agnostic alternative to per-instance optimization for RNA design. Our code and data are available at https://github.com/KuNyaa/RNA-Design-LM.

Designing RNAs with Language Models

TL;DR

Abstract

Paper Structure (38 sections, 13 equations, 12 figures, 2 tables)

This paper contains 38 sections, 13 equations, 12 figures, 2 tables.

Preliminaries: RNA Folding and Design
RNA Sequences and Structures
RNA Design Problem
Models and Constrained Decoding
RNA Design as Conditional Seq. Generation
Vocabulary and input.
Constrained Decoding
Constraint rule.
Implementation.
Effect.
Language Model Training
Pretrained Model
Surgery for RNA adaptation.
Why not use existing RNA LMs?
Why not encoder-decoder models?
...and 23 more sections

Figures (12)

Figure 2: We convert a general-domain LLM into an RNA designer by keeping the pretrained transformer backbone and shrinking the input and output layers. The original embedding and LM head are downsized and reinitialized to support RNA tokens.
Figure 3: Workflows for constructing (a) SL and (b) RL training datasets. Here, $d_{\text{min\_norm}}$ denotes $d_{\text{min\_norm}}(\boldsymbol{{y}}\xspace\xspace^{\star}, \mathbfcal{Y}_{\text{test}}\xspace)$.
Figure 4: RL dataset selection. (a) Removing structures in $\mathbfcal{Y}_{\text{RL\_raw}}$ that are too close to the Eterna100 testset. (b) Selecting the small subset $\mathbfcal{Y}_{\text{RL}}$ from $\mathbfcal{Y}_{\text{RL\_large}}$. (c--d) Decoding results on $\mathbfcal{Y}_{\text{RL\_large}}$ using RL models trained on $\mathbfcal{Y}_{\text{RL}}$ and $\mathbfcal{Y}_{\text{RL\_large}}$.
Figure 5: Supervised learning results of different models trained on $\mathbfcal{YX}_{\text{train}}$ in terms of best-of-$N$ Boltzmann probability on Eterna100.
Figure 6: Naive decoding vs. constrained decoding on the Union Test Set $\mathbfcal{Y}_{\text{test}}$($10^3$ samples per structure; supervised learning model). Naive decoding becomes increasingly invalid for longer targets, while constrained decoding guarantees validity at a $\sim30\%$ slowdown.
...and 7 more figures

Designing RNAs with Language Models

TL;DR

Abstract

Designing RNAs with Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)