Table of Contents
Fetching ...

Quantifying the Role of OpenFold Components in Protein Structure Prediction

Tyler L. Hayes, Giri P. Krishnan

TL;DR

The paper tackles the challenge of understanding which architectural components of Evoformer-based protein structure predictors contribute most to accuracy. By performing systematic ablations of OpenFold components (attentional and non-attentional blocks) and measuring changes in TM-score across 154 proteins from a CAMEO subset, the authors identify MSA Column Attention, both MLP Transition layers, and the final Pair representation as broadly critical, with substantial reliance on evolutionary information. They further show that several components exhibit length-dependent importance, with longer proteins relying more on MSA-based features and shorter proteins depending more on triangle-based updates, highlighting heterogeneity across proteins. These results advance interpretability of AlphaFold-like models and suggest direction for targeted architectural improvements and analysis of structure prediction networks.

Abstract

Models such as AlphaFold2 and OpenFold have transformed protein structure prediction, yet their inner workings remain poorly understood. We present a methodology to systematically evaluate the contribution of individual OpenFold components to structure prediction accuracy. We identify several components that are critical for most proteins, while others vary in importance across proteins. We further show that the contribution of several components is correlated with protein length. These findings provide insight into how OpenFold achieves accurate predictions and highlight directions for interpreting protein prediction networks more broadly.

Quantifying the Role of OpenFold Components in Protein Structure Prediction

TL;DR

The paper tackles the challenge of understanding which architectural components of Evoformer-based protein structure predictors contribute most to accuracy. By performing systematic ablations of OpenFold components (attentional and non-attentional blocks) and measuring changes in TM-score across 154 proteins from a CAMEO subset, the authors identify MSA Column Attention, both MLP Transition layers, and the final Pair representation as broadly critical, with substantial reliance on evolutionary information. They further show that several components exhibit length-dependent importance, with longer proteins relying more on MSA-based features and shorter proteins depending more on triangle-based updates, highlighting heterogeneity across proteins. These results advance interpretability of AlphaFold-like models and suggest direction for targeted architectural improvements and analysis of structure prediction networks.

Abstract

Models such as AlphaFold2 and OpenFold have transformed protein structure prediction, yet their inner workings remain poorly understood. We present a methodology to systematically evaluate the contribution of individual OpenFold components to structure prediction accuracy. We identify several components that are critical for most proteins, while others vary in importance across proteins. We further show that the contribution of several components is correlated with protein length. These findings provide insight into how OpenFold achieves accurate predictions and highlight directions for interpreting protein prediction networks more broadly.

Paper Structure

This paper contains 17 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: One Evoformer Block in OpenFold. Each block operates on the MSA and Pair representations via a series of attention, transition, and update operations. For clarity, residual connections are omitted, except for the Outer Product Mean connection from the MSA to the Pair representation.
  • Figure 2: Differences ($\Delta$) Between Baseline TM-Scores and Component Studies Across Proteins. Higher scores indicate more deviation from the Baseline. Studies involving MSA and Pair Representations are in blue and orange, respectively.
  • Figure 3: Comparison of Baseline Differences in TM-Scores ($\Delta$) vs. Protein Length. Higher scores indicate more deviation from the Baseline. We show the best-fit line, the Spearman correlation coefficient, and its associated $p$-value. Each point represents the results for one protein.
  • Figure S1: Raw TM-Scores for Component Studies. Higher scores indicate better performance. Studies with MSA and Pair Representations are in blue and orange, respectively.
  • Figure S2: Differences ($\Delta$) Between Baseline TM-Scores and Component Studies Across Proteins. Higher scores indicate more deviation from the Baseline. Studies involving MSA and Pair Representations are in blue and orange, respectively.
  • ...and 2 more figures