Table of Contents
Fetching ...

Towards Interpretable Protein Structure Prediction with Sparse Autoencoders

Nithin Parsan, David J. Yang, John J. Yang

TL;DR

This work tackles the interpretability gap in protein structure prediction by scaling sparse autoencoders to the large ESM2-3B language model and introducing Matryoshka SAEs to learn hierarchical, sparse representations. It demonstrates that SAE reconstructions preserve language modeling and structure-prediction performance, while enabling biologically meaningful concept discovery and improved contact-map assessments. A targeted feature-steering case study shows that manipulating SAE features can steer ESMFold toward changes in solvent accessibility without compromising overall structure, highlighting mechanistic interpretability. The authors provide open-source code, datasets, models, and a visualizer to empower the community to explore these interpretable representations.

Abstract

Protein language models have revolutionized structure prediction, but their nonlinear nature obscures how sequence representations inform structure prediction. While sparse autoencoders (SAEs) offer a path to interpretability here by learning linear representations in high-dimensional space, their application has been limited to smaller protein language models unable to perform structure prediction. In this work, we make two key advances: (1) we scale SAEs to ESM2-3B, the base model for ESMFold, enabling mechanistic interpretability of protein structure prediction for the first time, and (2) we adapt Matryoshka SAEs for protein language models, which learn hierarchically organized features by forcing nested groups of latents to reconstruct inputs independently. We demonstrate that our Matryoshka SAEs achieve comparable or better performance than standard architectures. Through comprehensive evaluations, we show that SAEs trained on ESM2-3B significantly outperform those trained on smaller models for both biological concept discovery and contact map prediction. Finally, we present an initial case study demonstrating how our approach enables targeted steering of ESMFold predictions, increasing structure solvent accessibility while fixing the input sequence. To facilitate further investigation by the broader community, we open-source our code, dataset, pretrained models https://github.com/johnyang101/reticular-sae , and visualizer https://sae.reticular.ai .

Towards Interpretable Protein Structure Prediction with Sparse Autoencoders

TL;DR

This work tackles the interpretability gap in protein structure prediction by scaling sparse autoencoders to the large ESM2-3B language model and introducing Matryoshka SAEs to learn hierarchical, sparse representations. It demonstrates that SAE reconstructions preserve language modeling and structure-prediction performance, while enabling biologically meaningful concept discovery and improved contact-map assessments. A targeted feature-steering case study shows that manipulating SAE features can steer ESMFold toward changes in solvent accessibility without compromising overall structure, highlighting mechanistic interpretability. The authors provide open-source code, datasets, models, and a visualizer to empower the community to explore these interpretable representations.

Abstract

Protein language models have revolutionized structure prediction, but their nonlinear nature obscures how sequence representations inform structure prediction. While sparse autoencoders (SAEs) offer a path to interpretability here by learning linear representations in high-dimensional space, their application has been limited to smaller protein language models unable to perform structure prediction. In this work, we make two key advances: (1) we scale SAEs to ESM2-3B, the base model for ESMFold, enabling mechanistic interpretability of protein structure prediction for the first time, and (2) we adapt Matryoshka SAEs for protein language models, which learn hierarchically organized features by forcing nested groups of latents to reconstruct inputs independently. We demonstrate that our Matryoshka SAEs achieve comparable or better performance than standard architectures. Through comprehensive evaluations, we show that SAEs trained on ESM2-3B significantly outperform those trained on smaller models for both biological concept discovery and contact map prediction. Finally, we present an initial case study demonstrating how our approach enables targeted steering of ESMFold predictions, increasing structure solvent accessibility while fixing the input sequence. To facilitate further investigation by the broader community, we open-source our code, dataset, pretrained models https://github.com/johnyang101/reticular-sae , and visualizer https://sae.reticular.ai .

Paper Structure

This paper contains 34 sections, 3 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: a) Matryoshka Sparse Autoencoder (SAE) architecture for training on ESM2 hidden layer representations, showing nested sparse feature organization. b) SAE intervention framework for ESMFold, comparing normal operation (left) where all ESM2 hidden representations flow to the structure trunk, versus intervention (right) where only a modified layer 36 representation is used while ablating all other layers.
  • Figure 2: Analysis of feature-concept relationships and long-range contact accuracy.
  • Figure 3: Feature steering and SASA analysis.
  • Figure 4: Distribution of highest F1 scores achieved for each concept across models.
  • Figure 5: Left: Number of concepts where the row model achieves a higher maximum F1 score for a given concept than the column model. Right: Average score difference of the highest F1 scores for a given concept between models (row minus column) across all concepts. Each cell compares the best-performing feature for each concept between model pairs. Darker blue indicates stronger performance advantage for the row model.
  • ...and 4 more figures