Table of Contents
Fetching ...

Differentiable Folding for Nearest Neighbor Model Optimization

Ryan K. Krueger, Sharon Aviran, David H. Mathews, Jeffrey Zuber, Max Ward

TL;DR

The paper tackles the challenge of fitting the extensive Turner Nearest Neighbor thermodynamic parameters for RNA secondary structure by reframing parameter optimization as differentiable folding, enabling gradient-based fitting from both structural and thermodynamic data. It introduces a general, scalable framework implemented in JAX that differentiates through the partition function $Z_{q,\theta}$ to optimize NN parameters, with configurable base-parameter extrapolation and grammar variants ($d0$/$d2$). Using two data sources—thermodynamic optical melting data (RNAometer) and structural data (ArchiveII)—and a joint loss $\mathcal{L}(\theta)=(1-\alpha)\mathcal{L}_{\text{struct}}(\theta)+\alpha\mathcal{L}_{\text{thermo}}(\theta)$, the approach yields substantial improvements over baselines across multiple RNA families and unseen datasets, including dramatic gains in ground-truth sequence-structure probabilities (up to $1.3\times 10^{53}$). The work paves the way for integrating new experimental data, refining thermodynamic interpretations, and embedding NN parameter fitting as a module in larger deep learning pipelines, with software and RNAometer data made publicly available.

Abstract

The Nearest Neighbor model is the $\textit{de facto}$ thermodynamic model of RNA secondary structure formation and is a cornerstone of RNA structure prediction and sequence design. The current functional form (Turner 2004) contains $\approx13,000$ underlying thermodynamic parameters, and fitting these to both experimental and structural data is computationally challenging. Here, we leverage recent advances in $\textit{differentiable folding}$, a method for directly computing gradients of the RNA folding algorithms, to devise an efficient, scalable, and flexible means of parameter optimization that uses known RNA structures and thermodynamic experiments. Our method yields a significantly improved parameter set that outperforms existing baselines on all metrics, including an increase in the average predicted probability of ground-truth sequence-structure pairs for a single RNA family by over 23 orders of magnitude. Our framework provides a path towards drastically improved RNA models, enabling the flexible incorporation of new experimental data, definition of novel loss terms, large training sets, and even treatment as a module in larger deep learning pipelines. We make available a new database, RNAometer, with experimentally-determined stabilities for small RNA model systems.

Differentiable Folding for Nearest Neighbor Model Optimization

TL;DR

The paper tackles the challenge of fitting the extensive Turner Nearest Neighbor thermodynamic parameters for RNA secondary structure by reframing parameter optimization as differentiable folding, enabling gradient-based fitting from both structural and thermodynamic data. It introduces a general, scalable framework implemented in JAX that differentiates through the partition function to optimize NN parameters, with configurable base-parameter extrapolation and grammar variants (/). Using two data sources—thermodynamic optical melting data (RNAometer) and structural data (ArchiveII)—and a joint loss , the approach yields substantial improvements over baselines across multiple RNA families and unseen datasets, including dramatic gains in ground-truth sequence-structure probabilities (up to ). The work paves the way for integrating new experimental data, refining thermodynamic interpretations, and embedding NN parameter fitting as a module in larger deep learning pipelines, with software and RNAometer data made publicly available.

Abstract

The Nearest Neighbor model is the thermodynamic model of RNA secondary structure formation and is a cornerstone of RNA structure prediction and sequence design. The current functional form (Turner 2004) contains underlying thermodynamic parameters, and fitting these to both experimental and structural data is computationally challenging. Here, we leverage recent advances in , a method for directly computing gradients of the RNA folding algorithms, to devise an efficient, scalable, and flexible means of parameter optimization that uses known RNA structures and thermodynamic experiments. Our method yields a significantly improved parameter set that outperforms existing baselines on all metrics, including an increase in the average predicted probability of ground-truth sequence-structure pairs for a single RNA family by over 23 orders of magnitude. Our framework provides a path towards drastically improved RNA models, enabling the flexible incorporation of new experimental data, definition of novel loss terms, large training sets, and even treatment as a module in larger deep learning pipelines. We make available a new database, RNAometer, with experimentally-determined stabilities for small RNA model systems.

Paper Structure

This paper contains 10 sections, 6 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: An overview of our method for NN Model parameter optimization. A. NN model parameter fitting can be formulated as an optimization problem akin to training a neural network. The architecture is defined by the RNA folding grammar, the free parameters are (a subset of) the corresponding thermodynamic values, and the dataset comprises of (i) experimental optical melting experiments, which ascribe free energies to sequence-structure pairs, and (ii) structural data comprising of sequences and their most likely structures. B. Our method for parameter optimization via differentiable folding, in which a loss function is defined over thermodynamic quantities and its gradient is computed via differentiable folding for gradient descent.
  • Figure 2: Optimizing Nearest Neighbor parameters via gradient descent under our default settings (i.e. $\alpha = 0.5$, no terminal mismatches, dangling ends, or coaxial stacks (equivalent to d0 in ViennaRNA), and the extrapolation rules of Ref. zuber2018analysis). A. The change in the average log-probability for all sequences of length $n \leq 512$ for each family within the ArchiveII dataset. Since optimization is performed via stochastic gradient descent, points depict periodic evaluations of the entire dataset. 23S ribosomal RNAs are excluded from the training set. Dashed lines depict baseline values computed via ViennaRNA with the default Turner 2004 parameters. B. The change in normalized mean squared error (MSE) between the ground truth free energy values from the thermodynamic dataset of optical melting experiments and the computed values. Dashed lines depict this value evaluated using ViennaRNA with the Turner 2004 parameters under d0.
  • Figure S1: Absolute changes in parameter values, grouped by parameter type, for the optimization depicted in Figure 2.
  • Figure S2: Flexibly changing the formulation of the optimization problem. A. The final unscaled structural and thermodynamic loss values for optimizations with the same parameters as in Figure 2 but with varying values of $\alpha$, which controls the relative importance of the two terms. B. The total loss over time for four variants of the optimization problem in which we (i) follow either the d0 or d2 convention, and (ii) apply either the minimal set of parameter extrapolations or the more stringent extrapolation rules of Zuber et al.
  • Figure S3: The log-probabilities for all telomerase sequence/structure pairs with $n \leq 512$ under the parameters optimized using both Equation \ref{['eqn:orig-struct']} and Equation \ref{['eqn:alt-struct']} as the structural loss. Equation \ref{['eqn:orig-struct']} corresponds to maximizing the average log probability per family (yellow) and Equation \ref{['eqn:alt-struct']} corresponds to maximizing the logarithm of the average probability per family (blue).