Sampling-based Continuous Optimization with Coupled Variables for RNA Design

Wei Yu Tang; Ning Dai; Tianshuo Zhou; David H. Mathews; Liang Huang

Sampling-based Continuous Optimization with Coupled Variables for RNA Design

Wei Yu Tang, Ning Dai, Tianshuo Zhou, David H. Mathews, Liang Huang

TL;DR

This work reframes RNA design as a sampling-based continuous optimization problem using coupled-variable distributions to exclude invalid sequences and capture nucleotide dependencies. It introduces two parameterizations (softmax and projection) and unbiased gradient estimation via sampling, scalable to structures up to $400$ nt with beam-pruned LinearPartition for efficiency. On the Eterna100 benchmark, the softmax version optimizing $p(oldsymbol{y}^* \,|\, oldsymbol{x})$ achieves higher arithmetic and geometric means of design quality than baselines, solving more puzzles under MFE and uMFE criteria, particularly for long puzzles. The approach provides a practical, extensible framework with potential applications beyond RNA, including protein design, and highlights avenues for speedups and future model enhancements.

Abstract

The task of RNA design given a target structure aims to find a sequence that can fold into that structure. It is a computationally hard problem where some version(s) have been proven to be NP-hard. As a result, heuristic methods such as local search have been popular for this task, but by only exploring a fixed number of candidates. They can not keep up with the exponential growth of the design space, and often perform poorly on longer and harder-to-design structures. We instead formulate these discrete problems as continuous optimization, which starts with a distribution over all possible candidate sequences, and uses gradient descent to improve the expectation of an objective function. We define novel distributions based on coupled variables to rule out invalid sequences given the target structure and to model the correlation between nucleotides. To make it universally applicable to any objective function, we use sampling to approximate the expected objective function, to estimate the gradient, and to select the final candidate. Compared to the state-of-the-art methods, our work consistently outperforms them in key metrics such as Boltzmann probability, ensemble defect, and energy gap, especially on long and hard-to-design puzzles in the Eterna100 benchmark. Our code is available at: http://github.com/weiyutang1010/ncrna_design.

Sampling-based Continuous Optimization with Coupled Variables for RNA Design

TL;DR

Abstract

Sampling-based Continuous Optimization with Coupled Variables for RNA Design

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)