Table of Contents
Fetching ...

Sampling at intermediate temperatures is optimal for training large language models in protein structure prediction

L. Ghiringhelli, A. Zambon, G. Tiana

Abstract

We investigate the parameter space of transformer models trained on protein sequence data using a statistical mechanics framework, sampling the loss landscape at varying temperatures by Langevin dynamics to characterize the low-loss manifold and understand the mechanisms underlying the superior performance of transformers in protein structure prediction. We find that, at variance with feedforward networks, the lack of a first--order--like transition in the loss of the transformer produces a range of intermediate temperatures with good learning properties. We show that the parameters of most layers are highly conserved at these temperatures if the dimension of the embedding is optimal, and we provide an operative way to find this dimension. Finally, we show that the attention matrix is more predictive of the contact maps of the protein at higher temperatures and for higher dimensions of the embedding than those optimal for learning.

Sampling at intermediate temperatures is optimal for training large language models in protein structure prediction

Abstract

We investigate the parameter space of transformer models trained on protein sequence data using a statistical mechanics framework, sampling the loss landscape at varying temperatures by Langevin dynamics to characterize the low-loss manifold and understand the mechanisms underlying the superior performance of transformers in protein structure prediction. We find that, at variance with feedforward networks, the lack of a first--order--like transition in the loss of the transformer produces a range of intermediate temperatures with good learning properties. We show that the parameters of most layers are highly conserved at these temperatures if the dimension of the embedding is optimal, and we provide an operative way to find this dimension. Finally, we show that the attention matrix is more predictive of the contact maps of the protein at higher temperatures and for higher dimensions of the embedding than those optimal for learning.

Paper Structure

This paper contains 20 sections, 41 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: The structure of acyl--coenzyme A binding protein (pdb code: 2abd), where the contacts between amino acids are indicated with dashed red lines.
  • Figure 2: The scheme of the $L$-module transformer (upper scheme) and of the feedforward model (lower scheme) used in this work.
  • Figure 3: Right of the line, the average value of the potential energy $\langle U \rangle_T$ (a), of the specific heat $d\langle U\rangle_T/dT$ (b), of the validation error $\langle \epsilon_{\mathrm{val}} \rangle_T$ (c), and of the test error $\langle \epsilon_{\mathrm{test}} \rangle_T$ (d) as functions of the rescaled temperature $T/T_{1/2}$ for the models $\mathrm{TF}_1$ (red curves), $\mathrm{TF}_4$ (blue curves) and $\mathrm{FF}$ (green curves). Left of the line, the same for early–stopped solutions (ES) and over-trained solutions (OT) obtained with Adam. Circles denote numerical measurements, while solid lines represent spline interpolations. All points have an error bar which indicates the standard deviation.
  • Figure 4: The distributions $\rho$ of the similarity $q^l$ of parameters for some of the layers, obtained by comparing equilibrated configurations at temperature $T^{\mathrm{best}}$ for each architecture (identified by different colors). The complete set is displayed in Fig. \ref{['app_fig:q_tot']} in the Appendix. The architectures labeled as $\mathrm{TF}_1^{26}$ and $\mathrm{TF}_4^{14}$ correspond to transformers of reduced dimensionality (see text).
  • Figure 5: The distribution $\rho$ of the values of $\boldsymbol{\gamma}_{\mathrm{weight}}^{(1)}$ (a) and $\boldsymbol{\gamma}_{\mathrm{weight}}^{(2)}$ (b) obtained from the sampling at $T^{\mathrm{best}}$ of the original and reduced transformers (solid lines). The dotted red lines indicate the distribution of $\boldsymbol{\gamma}_{\mathrm{weight}}^{(1/2)}$ values obtained from the ES minimization. The values reported for the four--module transformers refer to the last module.
  • ...and 9 more figures