Table of Contents
Fetching ...

Sampling-based Continuous Optimization with Coupled Variables for RNA Design

Wei Yu Tang, Ning Dai, Tianshuo Zhou, David H. Mathews, Liang Huang

TL;DR

This work reframes RNA design as a sampling-based continuous optimization problem using coupled-variable distributions to exclude invalid sequences and capture nucleotide dependencies. It introduces two parameterizations (softmax and projection) and unbiased gradient estimation via sampling, scalable to structures up to $400$ nt with beam-pruned LinearPartition for efficiency. On the Eterna100 benchmark, the softmax version optimizing $p(oldsymbol{y}^* \,|\, oldsymbol{x})$ achieves higher arithmetic and geometric means of design quality than baselines, solving more puzzles under MFE and uMFE criteria, particularly for long puzzles. The approach provides a practical, extensible framework with potential applications beyond RNA, including protein design, and highlights avenues for speedups and future model enhancements.

Abstract

The task of RNA design given a target structure aims to find a sequence that can fold into that structure. It is a computationally hard problem where some version(s) have been proven to be NP-hard. As a result, heuristic methods such as local search have been popular for this task, but by only exploring a fixed number of candidates. They can not keep up with the exponential growth of the design space, and often perform poorly on longer and harder-to-design structures. We instead formulate these discrete problems as continuous optimization, which starts with a distribution over all possible candidate sequences, and uses gradient descent to improve the expectation of an objective function. We define novel distributions based on coupled variables to rule out invalid sequences given the target structure and to model the correlation between nucleotides. To make it universally applicable to any objective function, we use sampling to approximate the expected objective function, to estimate the gradient, and to select the final candidate. Compared to the state-of-the-art methods, our work consistently outperforms them in key metrics such as Boltzmann probability, ensemble defect, and energy gap, especially on long and hard-to-design puzzles in the Eterna100 benchmark. Our code is available at: http://github.com/weiyutang1010/ncrna_design.

Sampling-based Continuous Optimization with Coupled Variables for RNA Design

TL;DR

This work reframes RNA design as a sampling-based continuous optimization problem using coupled-variable distributions to exclude invalid sequences and capture nucleotide dependencies. It introduces two parameterizations (softmax and projection) and unbiased gradient estimation via sampling, scalable to structures up to nt with beam-pruned LinearPartition for efficiency. On the Eterna100 benchmark, the softmax version optimizing achieves higher arithmetic and geometric means of design quality than baselines, solving more puzzles under MFE and uMFE criteria, particularly for long puzzles. The approach provides a practical, extensible framework with potential applications beyond RNA, including protein design, and highlights avenues for speedups and future model enhancements.

Abstract

The task of RNA design given a target structure aims to find a sequence that can fold into that structure. It is a computationally hard problem where some version(s) have been proven to be NP-hard. As a result, heuristic methods such as local search have been popular for this task, but by only exploring a fixed number of candidates. They can not keep up with the exponential growth of the design space, and often perform poorly on longer and harder-to-design structures. We instead formulate these discrete problems as continuous optimization, which starts with a distribution over all possible candidate sequences, and uses gradient descent to improve the expectation of an objective function. We define novel distributions based on coupled variables to rule out invalid sequences given the target structure and to model the correlation between nucleotides. To make it universally applicable to any objective function, we use sampling to approximate the expected objective function, to estimate the gradient, and to select the final candidate. Compared to the state-of-the-art methods, our work consistently outperforms them in key metrics such as Boltzmann probability, ensemble defect, and energy gap, especially on long and hard-to-design puzzles in the Eterna100 benchmark. Our code is available at: http://github.com/weiyutang1010/ncrna_design.

Paper Structure

This paper contains 18 sections, 13 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 5: (a) Results of various RNA Design methods on the Eterna100 dataset. Bold: best value. Underline: second best value. Italic: the byproduct is obtained by evaluating the final solution. This work and SAMFEO take the best solution across the entire optimization trajectory. $\dagger$: geometric mean without 18 undesignable puzzles. (b) -- (c) Comparison between this work and SAMFEO for $p(\boldsymbol{{y}}\xspace\xspace^{\!\!\;\text{$\star$}}\xspace \mid \boldsymbol{{x}}\xspace\xspace)$, grouped by puzzle lengths. Each group contains 10 puzzles, except for the geometric means, which exclude undesignable puzzles. (d) Puzzles solved by this work, SAMFEO, and NEMO under the $\text{uMFE}$ criterion. (e) -- (f) $p(\boldsymbol{{y}}\xspace\xspace^{\!\!\;\text{$\star$}}\xspace \mid \boldsymbol{{x}}\xspace\xspace)$ of solutions designed by this work vs. SAMFEO in both original and log scale. Figure \ref{['fig:si-main-results']} provides similar grouped-by-length and individual plots for other metrics. Starred puzzles: \ref{['fig:nemo_examples']}, \ref{['fig:73']}, \ref{['fig:76']}, \ref{['fig:78']}, \ref{['fig:nemo_examples']}, \ref{['fig:91']}, \ref{['fig:learning-curves']}, and \ref{['fig:99']} are hyperlinked to their visualizations.
  • Figure 6: Comparison of the best $p(\boldsymbol{{y}}\xspace\xspace^{\!\!\;\text{$\star$}}\xspace \mid \boldsymbol{{x}}\xspace\xspace)$ solutions designed by this work vs. SAMFEO for Puzzle 73 ("Snowflake 4"). (b) -- (c) $\text{MFE}$ structures of the solutions from this work and SAMFEO. Base-pairs are colored as follows: blue for correct pairs, red for incorrect pairs, with the intensity indicating pairing probability. Nucleotide colors range from blue to red, indicating positional defect. (d) -- (e) Base-pairing probabilities of this work and SAMFEO. Orange represents missing correct pairs (i.e. correct pairs with a pairing probability below 0.1).
  • Figure 7: Comparison of the best $p(\boldsymbol{{y}}\xspace\xspace^{\!\!\;\text{$\star$}}\xspace \mid \boldsymbol{{x}}\xspace\xspace)$ solution designed by this work vs. SAMFEO for Puzzle 78 ("Mat - Lot 2-2 B"). (b) Target structure: pink-filled regions highlight loops that belong to an undesignable motif, while orange base pairs represent the missing pairs in Sampling's $\text{MFE}$ structure. (c) -- (d) $\text{MFE}$ structures of the best $p(\boldsymbol{{y}}\xspace\xspace^{\!\!\;\text{$\star$}}\xspace \mid \boldsymbol{{x}}\xspace\xspace)$ solutions from this work and SAMFEO. (e) -- (f) Base-pairing probabilities plots. Base-pairs are colored as follows: blue for correct pairs, red for incorrect pairs, with the intensity indicating pairing probability. Orange represents missing correct pairs (i.e. correct pairs with a pairing probability below 0.1). Nucleotide colors range from blue to red, indicating positional defect. $\tilde{\boldsymbol{{y}}\xspace\xspace}\xspace$ refers to the target structure with the (orange) base pairs from undesignable motifs removed (i.e. pairs $1$ and $2$ are removed).
  • Figure S1: $p(\boldsymbol{{y}}\xspace\xspace^{\!\!\;\text{$\star$}}\xspace \mid \boldsymbol{{x}}\xspace\xspace)$ of solutions designed by this work vs. Matthies et al. matthies+:2023 on the 51 shortest structures in Eterna100 (up to 104 nucleotides).
  • Figure S2: (a) -- (c) Average of metrics when puzzles are grouped by length, with each group consisting of 10 puzzles. (d) $\text{NED}\xspace(\boldsymbol{{x}}\xspace\xspace, \boldsymbol{{y}}\xspace\xspace^{\!\!\;\text{$\star$}}\xspace)$ of solutions designed by this work vs. SAMFEO. Figure \ref{['fig:main_result']} displays similar grouped-by-length plots and scatterplots for other metrics.
  • ...and 6 more figures