Table of Contents
Fetching ...

InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders

Elana Simon, James Zou

TL;DR

The paper addresses the opacity of protein language models by introducing InterPLM, a framework that uses sparse autoencoders to extract interpretable latent features from ESM-2 embeddings. It shows that thousands of latent features per layer align with known Swiss-Prot concepts and that LLMs can describe these features, with an interactive dashboard to explore them. Compared to neuron-level analyses, SAE features yield far more concept-aligned interpretable representations and enable practical applications such as missing-annotation discovery and steering of sequence generation. The work provides a scalable, end-to-end pipeline for interpreting PLMs, annotating novel features, and guiding protein design, with broad implications for biological discovery and model development.

Abstract

Protein language models (PLMs) have demonstrated remarkable success in protein modeling and design, yet their internal mechanisms for predicting structure and function remain poorly understood. Here we present a systematic approach to extract and analyze interpretable features from PLMs using sparse autoencoders (SAEs). By training SAEs on embeddings from the PLM ESM-2, we identify up to 2,548 human-interpretable latent features per layer that strongly correlate with up to 143 known biological concepts such as binding sites, structural motifs, and functional domains. In contrast, examining individual neurons in ESM-2 reveals up to 46 neurons per layer with clear conceptual alignment across 15 known concepts, suggesting that PLMs represent most concepts in superposition. Beyond capturing known annotations, we show that ESM-2 learns coherent concepts that do not map onto existing annotations and propose a pipeline using language models to automatically interpret novel latent features learned by the SAEs. As practical applications, we demonstrate how these latent features can fill in missing annotations in protein databases and enable targeted steering of protein sequence generation. Our results demonstrate that PLMs encode rich, interpretable representations of protein biology and we propose a systematic framework to extract and analyze these latent features. In the process, we recover both known biology and potentially new protein motifs. As community resources, we introduce InterPLM (interPLM.ai), an interactive visualization platform for exploring and analyzing learned PLM features, and release code for training and analysis at github.com/ElanaPearl/interPLM.

InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders

TL;DR

The paper addresses the opacity of protein language models by introducing InterPLM, a framework that uses sparse autoencoders to extract interpretable latent features from ESM-2 embeddings. It shows that thousands of latent features per layer align with known Swiss-Prot concepts and that LLMs can describe these features, with an interactive dashboard to explore them. Compared to neuron-level analyses, SAE features yield far more concept-aligned interpretable representations and enable practical applications such as missing-annotation discovery and steering of sequence generation. The work provides a scalable, end-to-end pipeline for interpreting PLMs, annotating novel features, and guiding protein design, with broad implications for biological discovery and model development.

Abstract

Protein language models (PLMs) have demonstrated remarkable success in protein modeling and design, yet their internal mechanisms for predicting structure and function remain poorly understood. Here we present a systematic approach to extract and analyze interpretable features from PLMs using sparse autoencoders (SAEs). By training SAEs on embeddings from the PLM ESM-2, we identify up to 2,548 human-interpretable latent features per layer that strongly correlate with up to 143 known biological concepts such as binding sites, structural motifs, and functional domains. In contrast, examining individual neurons in ESM-2 reveals up to 46 neurons per layer with clear conceptual alignment across 15 known concepts, suggesting that PLMs represent most concepts in superposition. Beyond capturing known annotations, we show that ESM-2 learns coherent concepts that do not map onto existing annotations and propose a pipeline using language models to automatically interpret novel latent features learned by the SAEs. As practical applications, we demonstrate how these latent features can fill in missing annotations in protein databases and enable targeted steering of protein sequence generation. Our results demonstrate that PLMs encode rich, interpretable representations of protein biology and we propose a systematic framework to extract and analyze these latent features. In the process, we recover both known biology and potentially new protein motifs. As community resources, we introduce InterPLM (interPLM.ai), an interactive visualization platform for exploring and analyzing learned PLM features, and release code for training and analysis at github.com/ElanaPearl/interPLM.

Paper Structure

This paper contains 52 sections, 3 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Overview of SAE methodology and representative SAE features revealed through automated activation pattern analysis. a) Pipeline illustrating the extraction of embeddings, their conversion to features, and subsequent reinsertion of reconstructed embeddings into the PLM. b-c) Examples of features exhibiting interpretable activation patterns, both structural and conceptual. Each feature is visualized using a protein where maximal activation occurs. Feature activation intensities are displayed both along the protein sequence (amino acid height indicates activation magnitude) and protein structure (darker pink indicates stronger activation). b) Features selected to demonstrate activation patterns in structurally proximate amino acids, sequentially proximate amino acids, or both. c) Features selected based on significant associations between activation patterns and known Swiss-Prot biological concept annotations. Feature identifiers and corresponding layers (top to bottom): Left panel (f/3147, f/10091, f/67, layer 4); Right panel (f/7125, f/8128, f/1455 from layers 4, 5, 5).
  • Figure 2: SAE feature analysis and visualizations reveal features with diverse and consistent activation patterns. a) Quantitative comparison of learned features through four complementary approaches: 1) Feature activation frequency distribution showing the relationship between proteome-wide prevalence (x-axis) and protein-specific activation strength (y-axis), revealing both ubiquitous and selective features 2) Structural vs. sequential activation patterns, comparing feature activation strengths in 3D versus sequence proximity to peak activation sites, revealing features that operate through either structural or sequential mechanisms 3) UMAP embedding of feature vectors, illustrating natural clustering of related structural/functional motifs 4) Swiss-Prot concept mapping results, linking learned features to known biological concepts. b) Structural visualization of four representative features (left to right: f/1854, f/10230, f/8144, f/8128) mapped onto example proteins, with each feature highlighted in the figure above it from (a) in blue.
  • Figure 3: SAE features have stronger associations with Swiss-Prot concepts than ESM neurons. Comparing the features of an SAE model trained on ESM-2 embeddings (pink), the original neurons of the ESM-2 embeddings (blue), and the features of an SAE trained on embeddings from an ESM-2 model with shuffled weights (green). Models are compared based on the F1 scores between features and Swiss-Prot concepts. (a) For each concept, select one feature from the validation set (based on highest F1 with that concept), and visualize the F1 for that feature and concept in a held-out set. (b) For each layer, count the number of features that have an F1 score with any concept > 0.5 in both the validation and held-out sets.
  • Figure 4: Clustering reveals groups of features with similar functional and structural roles but subtle differences in activation patterns. (a) UMAP of SAE features clustered based on their dictionary values. Features associated with one of the top 20 most commonly labeled Swiss-Prot concepts are highlighted in color and severely random clusters (Swiss-Prot labeled or not) are manually identified. In particular, one cluster of kinase features is expanded. (b) Structures of the maximum activating examples from 3 features selected from cluster of kinase-binding-site features. Higher activation values indicated with darker pink and location of catalytic loop specified. (c) Comparing the maximum activation value of each feature on the maximally activating proteins selected by the other kinases. (d) Comparing the physical locations within the kinase binding site these features activate, with specific annotations for each sub-region within kinase binding site that has feature activation. All features visualized on protein kinase A (Uniprot ID: P17612) (e) Structures of maximum activating examples from 3 features selected from cluster of TBDR beta barrels (f) Activation patterns of 3 proteins in a cluster unlabeled by Swiss-Prot concepts but identified as glycosyltransferase cluster. All features visualized on glycosyltransferase amsK (Uniprot: Q46638), the maximally activating protein for f/9047.
  • Figure 5: Language models can generate automatic feature descriptions for SAE features. (a) Workflow for generating and validating descriptions with Claude-3.5 Sonnet (new). (b) Comparing measured maximum activation values in proteins to predicted maximum activation values via Pearson r correlation across 1200 features. (c) Examples of generated feature descriptions and maximally activated proteins of each feature. Predicted activations quality visualized via kernel density estimation. The text is Claude's description summary of each feature. Elements of description present in max examples annotated next to structures.
  • ...and 8 more figures