Boltzmann machine learning and regularization methods for inferring evolutionary fields and couplings from a multiple sequence alignment
Sanzo Miyazawa
TL;DR
The paper tackles inferring evolutionary fields and couplings from protein MSAs via the inverse Potts model, framing $P(\sigma) \propto \exp(-\psi_N(\sigma))$ with $\psi_N(\sigma) = -\sum_i h_i(\sigma_i) - \sum_{i<j} J_{ij}(\sigma_i,\sigma_j)$. It demonstrates that regularization tuned to equate sample and ensemble evolutionary energies, coupled with a sparse-coupling model (L2 for fields, group L1 for couplings) and a gradient update scheme where steps are proportional to the gradient (ModAdam), yields accurate recoveries of single-site frequencies and pairwise correlations from PF00595 and PF00153 MSAs. The study systematically compares gradient methods and regularization models, showing that ModAdam and L2-GL1 provide the best balance between sparsity and accuracy, though natural protein energy distributions are harder to reproduce than for protein-like sequences. It further links inferred energies to folding energetics through a selective temperature $T_s$, with $\hat{T_s}$ around 0.389–0.400 kcal/mol, and discusses implications for protein evolution, energy landscapes, and potential improvements via restricted Boltzmann machines. Overall, the work advances principled hyperparameter tuning and learning dynamics for sparse, evolutionarily meaningful Potts models in protein families.
Abstract
The inverse Potts problem to infer a Boltzmann distribution for homologous protein sequences from their single-site and pairwise amino acid frequencies recently attracts a great deal of attention in the studies of protein structure and evolution. We study regularization and learning methods and how to tune regularization parameters to correctly infer interactions in Boltzmann machine learning. Using $L_2$ regularization for fields, group $L_1$ for couplings is shown to be very effective for sparse couplings in comparison with $L_2$ and $L_1$. Two regularization parameters are tuned to yield equal values for both the sample and ensemble averages of evolutionary energy. Both averages smoothly change and converge, but their learning profiles are very different between learning methods. The Adam method is modified to make stepsize proportional to the gradient for sparse couplings. It is shown by first inferring interactions from protein sequences and then from Monte Carlo samples that the fields and couplings can be well recovered, but that recovering the pairwise correlations in the resolution of a total energy is harder for the natural proteins than for the protein-like sequences. Selective temperature for folding/structural constrains in protein evolution is also estimated.
