Table of Contents
Fetching ...

Interpretable machine learning of amino acid patterns in proteins: a statistical ensemble approach

Anna Braghetto, Enzo Orlandini, Marco Baiesi

TL;DR

This work tackles the interpretability of unsupervised ML applied to protein sequence patterns by using an ensemble of restricted Boltzmann machines (RBMs) with an information bottleneck. The authors show that the ensemble compresses the essential correlations in sequences at the starts and ends of $\alpha$-helices and $\beta$-sheets into $2$–$3$ bits, while revealing nuanced amino-acid motifs and similarities. Key findings include Proline’s prominent role at helix starts, poly-Alanine motifs at helix ends, and a refined view of amino-acid groupings (e.g., D/E vs V/L/I/F) captured by PCA on RBM weights, which the authors interpret as an effective hydrophobicity axis. The study demonstrates that an interpretable, ensemble-based RBM approach can recover known amphiphilic patterns and uncover new motifs, offering mechanistic insights into secondary-structure formation with potential for guiding protein design and analysis.

Abstract

Explainable and interpretable unsupervised machine learning helps understand the underlying structure of data. We introduce an ensemble analysis of machine learning models to consolidate their interpretation. Its application shows that restricted Boltzmann machines compress consistently into a few bits the information stored in a sequence of five amino acids at the start or end of $α$-helices or $β$-sheets. The weights learned by the machines reveal unexpected properties of the amino acids and the secondary structure of proteins: (i) His and Thr have a negligible contribution to the amphiphilic pattern of $α$-helices; (ii) there is a class of $α$-helices particularly rich in Ala at their end; (iii) Pro occupies most often slots otherwise occupied by polar or charged amino acids, and its presence at the start of helices is relevant; (iv) Glu and especially Asp on one side, and Val, Leu, Iso, and Phe on the other, display the strongest tendency to mark amphiphilic patterns, i.e., extreme values of an "effective hydrophobicity", though they are not the most powerful (non) hydrophobic amino acids.

Interpretable machine learning of amino acid patterns in proteins: a statistical ensemble approach

TL;DR

This work tackles the interpretability of unsupervised ML applied to protein sequence patterns by using an ensemble of restricted Boltzmann machines (RBMs) with an information bottleneck. The authors show that the ensemble compresses the essential correlations in sequences at the starts and ends of -helices and -sheets into bits, while revealing nuanced amino-acid motifs and similarities. Key findings include Proline’s prominent role at helix starts, poly-Alanine motifs at helix ends, and a refined view of amino-acid groupings (e.g., D/E vs V/L/I/F) captured by PCA on RBM weights, which the authors interpret as an effective hydrophobicity axis. The study demonstrates that an interpretable, ensemble-based RBM approach can recover known amphiphilic patterns and uncover new motifs, offering mechanistic insights into secondary-structure formation with potential for guiding protein design and analysis.

Abstract

Explainable and interpretable unsupervised machine learning helps understand the underlying structure of data. We introduce an ensemble analysis of machine learning models to consolidate their interpretation. Its application shows that restricted Boltzmann machines compress consistently into a few bits the information stored in a sequence of five amino acids at the start or end of -helices or -sheets. The weights learned by the machines reveal unexpected properties of the amino acids and the secondary structure of proteins: (i) His and Thr have a negligible contribution to the amphiphilic pattern of -helices; (ii) there is a class of -helices particularly rich in Ala at their end; (iii) Pro occupies most often slots otherwise occupied by polar or charged amino acids, and its presence at the start of helices is relevant; (iv) Glu and especially Asp on one side, and Val, Leu, Iso, and Phe on the other, display the strongest tendency to mark amphiphilic patterns, i.e., extreme values of an "effective hydrophobicity", though they are not the most powerful (non) hydrophobic amino acids.
Paper Structure (23 sections, 5 equations, 9 figures, 1 table)

This paper contains 23 sections, 5 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Sketch of an RBM with $N_v=12$ visible units (black circles, where data are given as an input) and $N_h=3$ hidden units (white circles). Red and blue shades indicate positive and negative values of single weights (plotted as lines joining units in the two layers) and biases (boxes next to units). We will follow a similar color scheme in Figures 4-7 below.
  • Figure 2: For (a) CD-1 and (b) PCD-10, we show the number and relative size of groups emerging from clustering hidden units in the ensemble of RBMs for every position of the secondary structure that we study. In both cases, we conclude the following: for $\alpha$-helices, $N_h=3$ is the optimal number of hidden units, while for $\beta$-sheets, it is $N_h=2$. These are the maximum values where the number of groups matches the number of hidden units, and the noise is still tiny, i.e., where each RBM in the ensemble has learned the same set of hidden units.
  • Figure 3: Pseudo log-likelihood as a function of the number of hidden units for RBMs trained with (a) CD-1 and (b) with PCD-10, shown for each of the four segments of secondary structure that we study. The PLL for the train and the validation set are compatible, showing that the RBMs have achieved robust training.
  • Figure 4: For the start of $\alpha$-helices, with ${N_h}=3$ hidden units: (a) Matrix with gray shade indicating the distance $d_{jm}$ between the weights $w_{ij}$ and $w_{im}$ of different hidden units $j,m$; the color boxes highlight the groups found by the DBSCAN clustering. (b) Average biases $a_i$ learned by the ensemble of RBMs, reshaped from an array with $20\Gamma=100$ entries to a $20\times \Gamma$ table, in which each column corresponds to a given encoding $\mathbf{v}_\gamma$ and each row to a given amino acid (a similar scheme is used in panels (c), (d), (e)). Values more negative than the lower threshold in the scale are marked with black squares (in this case for Cys and Pro, which essentially leads to the negligible probability of finding these amino acids in those positions). (c), (d), and (e): Average weights of units in groups 1, 2, and 3. (f) The shade of each slot in each panel shows the probability of picking a specific amino acid at a given position. Hence columns are normalized to $1$. The first $2^{N_h}=8$ panels show the probabilities for every hidden state (the sequence of $\pm 1$'s in the parenthesis at the bottom, where it follows the value of its empirical frequency). Hidden states are labeled and ranked with decreasing frequency, e.g., $\alpha$-S1 is the most probable hidden state at the start of $\alpha$-helices. The last two panels show the average of RBM $\alpha$-S states weighted according to their frequency, and the actual probability of amino acids at the $\Gamma=5$ initial positions of $\alpha$-helices. In practice, the prescription of the RBM for reconstructing meaningful sequences would be to (i) pick a hidden state at random according to its frequency and (ii) according to probabilities in its table, for every position $\gamma \le \Gamma$ pick an amino acid at random. The values of the hidden bias in the aRBM for each group are $b_1=-1.129$, $b_2=1.270$, $b_3=-1.496$.
  • Figure 5: For the end of $\alpha$-helices with 3 hidden units, the same scheme as in Fig. \ref{['fig:a1']}. Hidden bias: $b_1=0.541$, $b_2=-0.410$, $b_3=-0.191$.
  • ...and 4 more figures