Differentiable Folding for Nearest Neighbor Model Optimization
Ryan K. Krueger, Sharon Aviran, David H. Mathews, Jeffrey Zuber, Max Ward
TL;DR
The paper tackles the challenge of fitting the extensive Turner Nearest Neighbor thermodynamic parameters for RNA secondary structure by reframing parameter optimization as differentiable folding, enabling gradient-based fitting from both structural and thermodynamic data. It introduces a general, scalable framework implemented in JAX that differentiates through the partition function $Z_{q,\theta}$ to optimize NN parameters, with configurable base-parameter extrapolation and grammar variants ($d0$/$d2$). Using two data sources—thermodynamic optical melting data (RNAometer) and structural data (ArchiveII)—and a joint loss $\mathcal{L}(\theta)=(1-\alpha)\mathcal{L}_{\text{struct}}(\theta)+\alpha\mathcal{L}_{\text{thermo}}(\theta)$, the approach yields substantial improvements over baselines across multiple RNA families and unseen datasets, including dramatic gains in ground-truth sequence-structure probabilities (up to $1.3\times 10^{53}$). The work paves the way for integrating new experimental data, refining thermodynamic interpretations, and embedding NN parameter fitting as a module in larger deep learning pipelines, with software and RNAometer data made publicly available.
Abstract
The Nearest Neighbor model is the $\textit{de facto}$ thermodynamic model of RNA secondary structure formation and is a cornerstone of RNA structure prediction and sequence design. The current functional form (Turner 2004) contains $\approx13,000$ underlying thermodynamic parameters, and fitting these to both experimental and structural data is computationally challenging. Here, we leverage recent advances in $\textit{differentiable folding}$, a method for directly computing gradients of the RNA folding algorithms, to devise an efficient, scalable, and flexible means of parameter optimization that uses known RNA structures and thermodynamic experiments. Our method yields a significantly improved parameter set that outperforms existing baselines on all metrics, including an increase in the average predicted probability of ground-truth sequence-structure pairs for a single RNA family by over 23 orders of magnitude. Our framework provides a path towards drastically improved RNA models, enabling the flexible incorporation of new experimental data, definition of novel loss terms, large training sets, and even treatment as a module in larger deep learning pipelines. We make available a new database, RNAometer, with experimentally-determined stabilities for small RNA model systems.
