Concept Bottleneck Language Models For protein design

Aya Abdelsalam Ismail; Tuomas Oikarinen; Amy Wang; Julius Adebayo; Samuel Stanton; Taylor Joren; Joseph Kleinhenz; Allen Goodman; Héctor Corrada Bravo; Kyunghyun Cho; Nathan C. Frey

Concept Bottleneck Language Models For protein design

Aya Abdelsalam Ismail, Tuomas Oikarinen, Amy Wang, Julius Adebayo, Samuel Stanton, Taylor Joren, Joseph Kleinhenz, Allen Goodman, Héctor Corrada Bravo, Kyunghyun Cho, Nathan C. Frey

TL;DR

This work introduces Concept Bottleneck Language Models (CB-LMs) for protein design, adding a concept bottleneck layer, an orthogonality constraint, and a linear decoder to enable controllable, interpretable, and debuggable generation. The CB-pLM variant scales from $24\mathrm{M}$ to $3\mathrm{B}$ parameters and encodes over $700$ human-understandable concepts while maintaining perplexity comparable to traditional masked protein language models, achieving up to $3\times$ stronger concept control and $16\%$ higher intervention accuracy than competing conditional architectures. The approach enables precise single- and multi-concept interventions, retains protein naturalness as shown by TAP metrics, and provides both local and global interpretability through the final linear layer weights. A Siltuximab case study and broader design experiments illustrate practical benefits for drug discovery, offering a pathway to trustworthy and debuggable generative protein models with potential applicability beyond proteins to other domains.

Abstract

We introduce Concept Bottleneck Protein Language Models (CB-pLM), a generative masked language model with a layer where each neuron corresponds to an interpretable concept. Our architecture offers three key benefits: i) Control: We can intervene on concept values to precisely control the properties of generated proteins, achieving a 3 times larger change in desired concept values compared to baselines. ii) Interpretability: A linear mapping between concept values and predicted tokens allows transparent analysis of the model's decision-making process. iii) Debugging: This transparency facilitates easy debugging of trained models. Our models achieve pre-training perplexity and downstream task performance comparable to traditional masked protein language models, demonstrating that interpretability does not compromise performance. While adaptable to any language model, we focus on masked protein language models due to their importance in drug discovery and the ability to validate our model's capabilities through real-world experiments and expert knowledge. We scale our CB-pLM from 24 million to 3 billion parameters, making them the largest Concept Bottleneck Models trained and the first capable of generative language modeling.

Concept Bottleneck Language Models For protein design

TL;DR

Abstract

Concept Bottleneck Language Models For protein design

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (29)