Table of Contents
Fetching ...

Mechanistic Interpretability of Antibody Language Models Using SAEs

Rebonto Haque, Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Charlotte M. Deane

TL;DR

The study tackles interpretability in antibody sequence modeling by applying sparse autoencoders to p-IgGen, a large autoregressive antibody language model. It compares TopK SAEs and Ordered SAEs for identifying human-interpretable latent features and for steering generation toward specific antibody concepts. Key findings show that TopK SAEs reliably map latent features to concepts like CDR identity and IGHJ4 germline identity but offer limited steerability, while Ordered SAEs provide more steerable, high-level controls at the cost of more complex activations. These results advance mechanistic interpretability in domain-specific protein language models and inform strategies for rational, controllable antibody library design.

Abstract

Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate an autoregressive antibody language model, p-IgGen, and steer its generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose an hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs are sufficient for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.

Mechanistic Interpretability of Antibody Language Models Using SAEs

TL;DR

The study tackles interpretability in antibody sequence modeling by applying sparse autoencoders to p-IgGen, a large autoregressive antibody language model. It compares TopK SAEs and Ordered SAEs for identifying human-interpretable latent features and for steering generation toward specific antibody concepts. Key findings show that TopK SAEs reliably map latent features to concepts like CDR identity and IGHJ4 germline identity but offer limited steerability, while Ordered SAEs provide more steerable, high-level controls at the cost of more complex activations. These results advance mechanistic interpretability in domain-specific protein language models and inform strategies for rational, controllable antibody library design.

Abstract

Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate an autoregressive antibody language model, p-IgGen, and steer its generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose an hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs are sufficient for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.

Paper Structure

This paper contains 23 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Latent activations (a) and neuron activations (b) for CDRH3 identity, and latent activations for IGHJ3 (c). The x-axis shows the amino-acid sequence of the VH region of a test antibody; the y-axis shows normalised activation. CDRs are coloured CDRH1 (red), CDRH2 (blue), and CDRH3 (green). Latent activations localise to the expected regions—CDRH3 in (a) and the heavy J region in (c)—whereas neuron activations (b) are scattered across the sequence with no discernible pattern.
  • Figure 2: Comparison of absolute positional (a) and IMGT (b) activations of top three IGHJ4 latents. The sequence/IMGT positions are shown on the x-axis. For the sequence positions, the amino acid sequences were end-padded to a constant length of 350. Percentage of total activations on any given position across validation IGHJ4 sequences is shown on the y-axis. The most frequent IMGT position for activation is highlighted for each latent. Latent activations show a distribution near the end of the heavy chain when aligned based on absolute sequence position. In contrast, latents demonstrate discrete activations when aligned based on IMGT numbering.
  • Figure 3: Results of IGHJ4 feature steering for latent 463 (a), 4720 (b), 6276 (c). Y-axis shows the proportion of generated sequences. Plots are coloured by heavy J gene identity. X-axis shows the steering factor used (alpha). Results are for a library of 1000 p-IgGen-generated sequences. For each latent tested (a-c), steering did not result in a predictable change in library composition.
  • Figure 4: Results of IGHJ4 steering using Ordered latent 12 (a) and 49 (b). Y-axis shows the proportion of generated sequences. Plots are coloured by heavy J gene identity. X-axis shows the steering factor used (alpha). Results are for a library of 1000 p-IgGen-generated sequences. Latent 12—positively correlated with IGHJ4—increases IGHJ4 proportion under positive steering, whereas latent 49—negatively correlated—decreases IGHJ4 under the same steering.
  • Figure 5: IMGT activations of latent 12 (a) and 49 (b). Activation patterns of both latents show scattered distribution across the range of IMGT positions.