Recent advances in interpretable machine learning using structure-based protein representations

Luiz Felipe Vecchietti; Minji Lee; Begench Hangeldiyev; Hyunkyu Jung; Hahnbeom Park; Tae-Kyun Kim; Meeyoung Cha; Ho Min Kim

Recent advances in interpretable machine learning using structure-based protein representations

Luiz Felipe Vecchietti, Minji Lee, Begench Hangeldiyev, Hyunkyu Jung, Hahnbeom Park, Tae-Kyun Kim, Meeyoung Cha, Ho Min Kim

TL;DR

This paper surveys how structure-based protein representations enable interpretable ML across three core tasks: structure prediction, functionality prediction, and protein–protein interactions. It emphasizes interpretable signals such as $pLDDT$ and $pAE$ in AlphaFold2-like models, and surveys methods that provide explanations via per-residue or per-edge patterns, including GradCAM on graph embeddings and decision-tree paths. The discussion highlights surface- and graph-based representations (MaSIF, dMaSIF) and their interpretability benefits, while also cautioning about limitations of post-hoc explanations and the need for inherently interpretable architectures. Practically, the work argues that improved visualization and interpretable metrics will accelerate protein design, drug discovery, and knowledge discovery in structural biology, guiding future methodological and visualization developments.

Abstract

Recent advancements in machine learning (ML) are transforming the field of structural biology. For example, AlphaFold, a groundbreaking neural network for protein structure prediction, has been widely adopted by researchers. The availability of easy-to-use interfaces and interpretable outcomes from the neural network architecture, such as the confidence scores used to color the predicted structures, have made AlphaFold accessible even to non-ML experts. In this paper, we present various methods for representing protein 3D structures from low- to high-resolution, and show how interpretable ML methods can support tasks such as predicting protein structures, protein function, and protein-protein interactions. This survey also emphasizes the significance of interpreting and visualizing ML-based inference for structure-based protein representations that enhance interpretability and knowledge discovery. Developing such interpretable approaches promises to further accelerate fields including drug development and protein design.

Recent advances in interpretable machine learning using structure-based protein representations

TL;DR

and

in AlphaFold2-like models, and surveys methods that provide explanations via per-residue or per-edge patterns, including GradCAM on graph embeddings and decision-tree paths. The discussion highlights surface- and graph-based representations (MaSIF, dMaSIF) and their interpretability benefits, while also cautioning about limitations of post-hoc explanations and the need for inherently interpretable architectures. Practically, the work argues that improved visualization and interpretable metrics will accelerate protein design, drug discovery, and knowledge discovery in structural biology, guiding future methodological and visualization developments.

Abstract

Paper Structure (12 sections, 7 figures, 1 table)

This paper contains 12 sections, 7 figures, 1 table.

Introduction
Structure-based protein representations
Representations in Structural Biology
Representations in Computational Biology
Interpretable Machine Learning for Protein Structural Biology
Protein structure prediction
Protein functionality prediction
Predictions based on Paths in Decision Trees
Importance per-residue in Graph Convolutional Neural Networks predictions
Understanding protein-protein interactions
Discussion
Conclusion

Figures (7)

Figure 1: Schematic showing definitions of different levels of protein structure definition and various representations available for structural biologists using PyMol pymol. Figures using the crystal structure of the human foetal deoxyhaemoglobin protein (PDB: 1FDH); (a) The primary protein structure consists of the sequence of amino acids in the polypeptide chain. The secondary protein structure consists of alpha helices and beta sheets formed by hydrogen bonding between atoms in the polypeptide backbone. The tertiary protein structure consists of the overall 3-dimensional structure of the folded protein chain. The quaternary protein structure consists of the structure formed by multiple interacting amino acid chains; (b) ribbon representation using lines; (c) cartoon representation; (d) all-atom representation using sticks; and (e) surface representation.
Figure 2: Examples of structure-based protein representations used in machine learning. Visualization of the green fluorescent protein (PDB: 1ema). (a) Protein structure represented by a point cloud showing backbone atom coordinate positions; (b) Protein structure represented by a distance matrix; (c) Protein structure represented as a graph with residue $C_\alpha$ atom positions represented as nodes and edges defined by neighbors in the amino acid sequence; (d) Mesh representation of the protein surface. Subfigure (a) plotted with Molecular Nodes molecularnodes. Subfigures (b) and (c) are plotted with Graphein graphein. Subfigure (d) plotted with PyMol pymol.
Figure 3: Structure predicted by ColabFold mirdita2022 for a tandem HMG box domain from the HMGB1 protein (PDB: 2YRQ) (a) structure colored by chain; (b) 2-dimensional plot of the pLDDT metric by residue index; and (c) structure colored by pLDDT score. Blue residues represent amino acids in which AF2 has high confidence in the prediction. Red residues represent amino acids in which AF2 has low confidence in the prediction.
Figure 4: Graph-based visualization of confidence metrics by AF2-Multimer evans2021 using ColabFold mirdita2022 predicted for a protein (PDB: 7XKY, Fumarate hydratase apo-protein complex) (a) Predicted structure by AF2 visualized using PyMol pymol; (b) Predicted structure by AF2 visualized as a graph using Graphein graphein. Nodes represent residues in 3D space. Edges represent the $k$-nearest neighbors of a residue, in which $k$ is set to 3. Nodes are colored by pLDDT in which lighter color means higher predicted error on the atom position; (c) Zoomed in view to show the internal structure of the plot presented in (b) in which edges shown in a lighter value are predicted to have higher pAE.
Figure 5: Visualization of the nodes leading to the largest increase and the largest decrease of the log-fluorescence value in a GBDT-based predictor. The top part of the image shows 3 nodes of the subtree that lead to the largest increase in log-fluorescence value. The bottom part of the image shows 3 nodes of the subtree that lead to the largest decrease in log-fluorescence value. Each node demonstrates the decision criteria and the positions of the amino acid pair. The amino acid pair related to the input feature being analyzed is highlighted in red. The visualization of the interaction is created using PyMol delano2002pymol with the cartoon representation of the wild-type protein (PDB: 1EMA). The terms $\text{M}_{(i,j)}$ and $\text{WT}_{(i, j)}$ represent the distogram value for the interaction between the i-th and j-th residue in the protein sequence for the mutant structure and wild-type structure, respectively.
...and 2 more figures

Recent advances in interpretable machine learning using structure-based protein representations

TL;DR

Abstract

Recent advances in interpretable machine learning using structure-based protein representations

Authors

TL;DR

Abstract

Table of Contents

Figures (7)