Table of Contents
Fetching ...

Boltzmann machine learning and regularization methods for inferring evolutionary fields and couplings from a multiple sequence alignment

Sanzo Miyazawa

TL;DR

The paper tackles inferring evolutionary fields and couplings from protein MSAs via the inverse Potts model, framing $P(\sigma) \propto \exp(-\psi_N(\sigma))$ with $\psi_N(\sigma) = -\sum_i h_i(\sigma_i) - \sum_{i<j} J_{ij}(\sigma_i,\sigma_j)$. It demonstrates that regularization tuned to equate sample and ensemble evolutionary energies, coupled with a sparse-coupling model (L2 for fields, group L1 for couplings) and a gradient update scheme where steps are proportional to the gradient (ModAdam), yields accurate recoveries of single-site frequencies and pairwise correlations from PF00595 and PF00153 MSAs. The study systematically compares gradient methods and regularization models, showing that ModAdam and L2-GL1 provide the best balance between sparsity and accuracy, though natural protein energy distributions are harder to reproduce than for protein-like sequences. It further links inferred energies to folding energetics through a selective temperature $T_s$, with $\hat{T_s}$ around 0.389–0.400 kcal/mol, and discusses implications for protein evolution, energy landscapes, and potential improvements via restricted Boltzmann machines. Overall, the work advances principled hyperparameter tuning and learning dynamics for sparse, evolutionarily meaningful Potts models in protein families.

Abstract

The inverse Potts problem to infer a Boltzmann distribution for homologous protein sequences from their single-site and pairwise amino acid frequencies recently attracts a great deal of attention in the studies of protein structure and evolution. We study regularization and learning methods and how to tune regularization parameters to correctly infer interactions in Boltzmann machine learning. Using $L_2$ regularization for fields, group $L_1$ for couplings is shown to be very effective for sparse couplings in comparison with $L_2$ and $L_1$. Two regularization parameters are tuned to yield equal values for both the sample and ensemble averages of evolutionary energy. Both averages smoothly change and converge, but their learning profiles are very different between learning methods. The Adam method is modified to make stepsize proportional to the gradient for sparse couplings. It is shown by first inferring interactions from protein sequences and then from Monte Carlo samples that the fields and couplings can be well recovered, but that recovering the pairwise correlations in the resolution of a total energy is harder for the natural proteins than for the protein-like sequences. Selective temperature for folding/structural constrains in protein evolution is also estimated.

Boltzmann machine learning and regularization methods for inferring evolutionary fields and couplings from a multiple sequence alignment

TL;DR

The paper tackles inferring evolutionary fields and couplings from protein MSAs via the inverse Potts model, framing with . It demonstrates that regularization tuned to equate sample and ensemble evolutionary energies, coupled with a sparse-coupling model (L2 for fields, group L1 for couplings) and a gradient update scheme where steps are proportional to the gradient (ModAdam), yields accurate recoveries of single-site frequencies and pairwise correlations from PF00595 and PF00153 MSAs. The study systematically compares gradient methods and regularization models, showing that ModAdam and L2-GL1 provide the best balance between sparsity and accuracy, though natural protein energy distributions are harder to reproduce than for protein-like sequences. It further links inferred energies to folding energetics through a selective temperature , with around 0.389–0.400 kcal/mol, and discusses implications for protein evolution, energy landscapes, and potential improvements via restricted Boltzmann machines. Overall, the work advances principled hyperparameter tuning and learning dynamics for sparse, evolutionarily meaningful Potts models in protein families.

Abstract

The inverse Potts problem to infer a Boltzmann distribution for homologous protein sequences from their single-site and pairwise amino acid frequencies recently attracts a great deal of attention in the studies of protein structure and evolution. We study regularization and learning methods and how to tune regularization parameters to correctly infer interactions in Boltzmann machine learning. Using regularization for fields, group for couplings is shown to be very effective for sparse couplings in comparison with and . Two regularization parameters are tuned to yield equal values for both the sample and ensemble averages of evolutionary energy. Both averages smoothly change and converge, but their learning profiles are very different between learning methods. The Adam method is modified to make stepsize proportional to the gradient for sparse couplings. It is shown by first inferring interactions from protein sequences and then from Monte Carlo samples that the fields and couplings can be well recovered, but that recovering the pairwise correlations in the resolution of a total energy is harder for the natural proteins than for the protein-like sequences. Selective temperature for folding/structural constrains in protein evolution is also estimated.

Paper Structure

This paper contains 29 sections, 33 equations, 29 figures, 5 tables.

Figures (29)

  • Figure 1: A schematic representation of how to tune regularization parameters, $\lambda_1$ for fields and $\lambda_2$ for couplings. The average evolutionary energy per residue $\overline{\psi_N}/L$ of natural sequences and the ensemble average of evolutionary energy per residue $(\bar{\psi} - \delta\psi^2)/L$ in the Ising gauge are plotted by circle and plus markes, respectively, against one of the regularization parameters, $\lambda_1 N_{eff}$ with $\lambda_2=\lambda_1$ or $\lambda_2 N_{eff}$ with $\lambda_1=\lambda_{1,0}$; $\lambda_{1,0}$ and $\lambda_{2,0}$ are the optimum values of $\lambda_{1}$ and $\lambda_{2}$, respectively. The regularization model L2-GL1 and the modified Adam method are employed; see Tables \ref{['tbl: PF00595_parameters']} and \ref{['tbl: PF00153_parameters']} for the values of $\lambda_{1,0}$ and $\lambda_{2,0}$. The upper and lower rows correspond to the figures for PF00595 and PF00153, respectively. The left four figures are for natural sequences of PF00595 and PF00153, and the right four figures are for the MCMC samples, MC1162 and MC1445 that are obtained in the Boltzmann machine learnings for PF00595 and PF00153, respectively.
  • Figure 2: Learning processes by the ModAdam, Adam, NAG, and RPROP-LR gradient-descent methods for PF00595. The averages of Kullback-Leibler divergences, $D^{2}_{\text{KL}}$ for pairwise marginal distributions and $D^{1}_{\text{KL}}$ for single-site marginal distributions, are drawn against iteration number in the learning processes. $D^{2}_{\text{KL}}$ and $D^{1}_{\text{KL}}$ for the Adam, NAG, and RPROP-LR are indicated by the upper and lower black lines in the left, middle, and right figures, respectively, and those for the ModAdam are shown in pink in all figures for comparison. The L2-L2 regularization model is employed. The values of hyper-parameters are listed in Table \ref{['tbl: PF00595_learning_methods']} as well as others.
  • Figure 3: Comparisons of the inferred couplings $J_{ij}(a_k,a_l)$ in the Ising gauge between the ModAdam and the other gradient-descent methods, Adam, NAG, and RPROP-LR, for PF00595. The abscissas correspond to the couplings inferred by the modified Adam, and the ordinates correspond to those by the Adam, NAG, and RPROP-LR in order from the left to the right. The regularization model L2-L2 is employed for all methods. The solid lines show the equal values between the ordinate and abscissa. The values of hyper-parameters are listed in Table \ref{['tbl: PF00595_parameters']}. The overlapped points of $J_{ij}(a_k,a_l)$ in the units 0.001 are removed.
  • Figure 4: Differences in the learning of coupling parameters, $J_{ij}(a_k,a_l)$, between the ModAdam and Adam gradient-descent methods for PF00595. All $J_{ij}(a_k,a_l)$ where $(a_k,a_l) = \text{argmax}_{a_k,a_l \neq \text{deletion}} | J_{ij}(a_k,a_l) |$ in the Ising gauge are plotted against the distance between $i$th and $j$th residues. The ModAdam and Adam methods are employed for the left and right figures, respectively. The regularization model L2-L2 is employed for both methods. The learning processes by both methods are shown in Fig. \ref{['fig: PF00595_learning_process_KL_by_ModAdam_and_Adam']}. Please notice that more strong couplings tend to be inferred for closely located residues pairs by the ModAdam method than by the Adam method. The values of hyper-parameters are listed in Table \ref{['tbl: PF00595_parameters']}.
  • Figure 5: The profile of the average evolutionary energies along the learning process in the L2-L2 model by each gradient-descent method for PF00595. The average evolutionary energy per residue $\overline{\psi_N}/L$ of natural sequences and the ensemble average of evolutionary energy per residue in the Gaussian approximation $(\bar{\psi} - \delta\psi^2)/L$ in the Ising gauge are plotted every 10 iterations against iteration number in the learning by each of the ModAdam, NAG, Adam, and RPROP-LR in the order of the left to the right; the sample and ensemble averages are indicated by the upper and lower lines, respectively. The L2-L2 regularization model is employed. For the ModAdam in the leftmost figure, those for the first run of 1220 iterations and for the second run, which is conditioned to run by more than 2000 iterations, are plotted by dots and solid lines, respectively; they indistinguishably overlap.
  • ...and 24 more figures