Mechanistic Interpretability of Antibody Language Models Using SAEs
Rebonto Haque, Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Charlotte M. Deane
TL;DR
The study tackles interpretability in antibody sequence modeling by applying sparse autoencoders to p-IgGen, a large autoregressive antibody language model. It compares TopK SAEs and Ordered SAEs for identifying human-interpretable latent features and for steering generation toward specific antibody concepts. Key findings show that TopK SAEs reliably map latent features to concepts like CDR identity and IGHJ4 germline identity but offer limited steerability, while Ordered SAEs provide more steerable, high-level controls at the cost of more complex activations. These results advance mechanistic interpretability in domain-specific protein language models and inform strategies for rational, controllable antibody library design.
Abstract
Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate an autoregressive antibody language model, p-IgGen, and steer its generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose an hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs are sufficient for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.
