Table of Contents
Fetching ...

Interpreting and Steering Protein Language Models through Sparse Autoencoders

Edith Natalia Villegas Garcia, Alessio Ansuini

TL;DR

This work demonstrates that sparse autoencoders can disentangle internal representations of protein language models, enabling mechanistic interpretability by linking latent components to biological annotations. By selecting an informative LM layer via intrinsic dimension plateau analysis and interpreting latent features through UniProt annotations, the authors identify latent directions associated with features such as zinc finger motifs. They further show that targeted interventions on these latents can steer sequence generation, achieving nontrivial motif-related designs (24/180 trials matching zinc finger motifs) with diverse sequences folded plausibly by ESMFold. This approach advances interpretability and controllable protein sequence design, providing a pathway toward more transparent and steerable biological sequence models, with code and data made available.

Abstract

The rapid advancements in transformer-based language models have revolutionized natural language processing, yet understanding the internal mechanisms of these models remains a significant challenge. This paper explores the application of sparse autoencoders (SAE) to interpret the internal representations of protein language models, specifically focusing on the ESM-2 8M parameter model. By performing a statistical analysis on each latent component's relevance to distinct protein annotations, we identify potential interpretations linked to various protein characteristics, including transmembrane regions, binding sites, and specialized motifs. We then leverage these insights to guide sequence generation, shortlisting the relevant latent components that can steer the model towards desired targets such as zinc finger domains. This work contributes to the emerging field of mechanistic interpretability in biological sequence models, offering new perspectives on model steering for sequence design.

Interpreting and Steering Protein Language Models through Sparse Autoencoders

TL;DR

This work demonstrates that sparse autoencoders can disentangle internal representations of protein language models, enabling mechanistic interpretability by linking latent components to biological annotations. By selecting an informative LM layer via intrinsic dimension plateau analysis and interpreting latent features through UniProt annotations, the authors identify latent directions associated with features such as zinc finger motifs. They further show that targeted interventions on these latents can steer sequence generation, achieving nontrivial motif-related designs (24/180 trials matching zinc finger motifs) with diverse sequences folded plausibly by ESMFold. This approach advances interpretability and controllable protein sequence design, providing a pathway toward more transparent and steerable biological sequence models, with code and data made available.

Abstract

The rapid advancements in transformer-based language models have revolutionized natural language processing, yet understanding the internal mechanisms of these models remains a significant challenge. This paper explores the application of sparse autoencoders (SAE) to interpret the internal representations of protein language models, specifically focusing on the ESM-2 8M parameter model. By performing a statistical analysis on each latent component's relevance to distinct protein annotations, we identify potential interpretations linked to various protein characteristics, including transmembrane regions, binding sites, and specialized motifs. We then leverage these insights to guide sequence generation, shortlisting the relevant latent components that can steer the model towards desired targets such as zinc finger domains. This work contributes to the emerging field of mechanistic interpretability in biological sequence models, offering new perspectives on model steering for sequence design.

Paper Structure

This paper contains 26 sections, 8 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Sequence generation procedure. (A) To steer the model outputs, the base Protein Language Model is modified through the insertion of a sparse autoencoder in the residual stream, at a particular layer. During inference, the value of one of the latents in the autoencoder is modified. (B) Starting from a random sequence, we perform inference with the modified and intervened model, and sample a new sequence from the output logits. We repeat this procedure iteratively a certain number of times (i.e. 100), and at the end we retain the sequence which gives the highest value for the activation of the target latent $z_k$.
  • Figure 2: Distribution of the number of latent SAE components that detect a feature with a minimum value of precision, recall and F1-score. Setting a higher value for the activation threshold $\tau_k$ significantly increases the precision with which latents detect features, but it decreases the recall.
  • Figure 3: Examples of generated sequences subsequently folded with ESMFold lin2022language. The sequences were generated while intervening on the model by increasing the value of latent components associated to the zinc finger motif. With the intervention, the model has a tendency to generate pairs of beta sheets in the vicinity of a helix, as in a typical zinc finger structure.
  • Figure 4: Evolution of the intrinsic dimension estimate through the layers of the ESM-2 8M model. We highlight the layers in the plateau/final ascent region.
  • Figure 5: Cross-entropy increase vs sparsity trade-off for all the vanilla sparse autoencoders trained on layer 3 embeddings from ESM-2 8M. The selected autoencoder is indicated by a dashed circle.
  • ...and 1 more figures