Table of Contents
Fetching ...

Induction Meets Biology: Mechanisms of Repeat Detection in Protein Language Models

Gal Kesten-Pomeranz, Yaniv Nikankin, Anja Reusch, Tomer Tsaban, Ora Schueler-Furman, Yonatan Belinkov

TL;DR

The results reveal how PLMs solve this biological task by combining language-based pattern matching with specialized biological knowledge, thereby establishing a basis for studying more complex evolutionary processes in PLMs.

Abstract

Protein sequences are abundant in repeating segments, both as exact copies and as approximate segments with mutations. These repeats are important for protein structure and function, motivating decades of algorithmic work on repeat identification. Recent work has shown that protein language models (PLMs) identify repeats, by examining their behavior in masked-token prediction. To elucidate their internal mechanisms, we investigate how PLMs detect both exact and approximate repeats. We find that the mechanism for approximate repeats functionally subsumes that of exact repeats. We then characterize this mechanism, revealing two main stages: PLMs first build feature representations using both general positional attention heads and biologically specialized components, such as neurons that encode amino-acid similarity. Then, induction heads attend to aligned tokens across repeated segments, promoting the correct answer. Our results reveal how PLMs solve this biological task by combining language-based pattern matching with specialized biological knowledge, thereby establishing a basis for studying more complex evolutionary processes in PLMs.

Induction Meets Biology: Mechanisms of Repeat Detection in Protein Language Models

TL;DR

The results reveal how PLMs solve this biological task by combining language-based pattern matching with specialized biological knowledge, thereby establishing a basis for studying more complex evolutionary processes in PLMs.

Abstract

Protein sequences are abundant in repeating segments, both as exact copies and as approximate segments with mutations. These repeats are important for protein structure and function, motivating decades of algorithmic work on repeat identification. Recent work has shown that protein language models (PLMs) identify repeats, by examining their behavior in masked-token prediction. To elucidate their internal mechanisms, we investigate how PLMs detect both exact and approximate repeats. We find that the mechanism for approximate repeats functionally subsumes that of exact repeats. We then characterize this mechanism, revealing two main stages: PLMs first build feature representations using both general positional attention heads and biologically specialized components, such as neurons that encode amino-acid similarity. Then, induction heads attend to aligned tokens across repeated segments, promoting the correct answer. Our results reveal how PLMs solve this biological task by combining language-based pattern matching with specialized biological knowledge, thereby establishing a basis for studying more complex evolutionary processes in PLMs.
Paper Structure (78 sections, 5 equations, 31 figures, 11 tables)

This paper contains 78 sections, 5 equations, 31 figures, 11 tables.

Figures (31)

  • Figure 1: Visualization of the repeat identification mechanism. The model predicts the masked token by integrating repetition-related (left) and biological features (right). (I) First, relative-position attention heads attend to tokens located at fixed offsets ($\pm n$) from the masked position, followed by the activation of biologically specialized neurons, such as neurons that selective for biochemically similar amino acids. (II) In middle layers, induction heads attend to the token aligned with the masked position in the other repeat instance and copy its information, enabling retrieval of the correct amino acid, while repetition neurons play an inhibitory role. (III) Finally, MLP neurons refine the final masked token distribution, with amino-acid-biased attention heads also contributing to the prediction.
  • Figure 2: ESM-3 circuit faithfulness scores. Across the three tasks, the discovered circuits achieve high faithfulness (above the 85% threshold) using a small fraction of model components.
  • Figure 3: Cross-Task Circuit Comparisons in ESM-3. We compare the IoU, recall and cross-task faithfulness of the circuits found for the three repeat tasks. The IoU (left) shows relatively high overlap between all three tasks. The recall (middle) measures the fraction of the ground-truth circuit (x-axis) recovered by the predicted circuit (y-axis), and shows that the synthetic and identical circuits are largely subsumed within each other and within the approximate-repeat circuit. The cross-task faithfulness (right) denotes the faithfulness of the circuit (y-axis) on data from another task variant (x-axis), with scores normalized per target task relative to the faithfulness of its own circuit. The results show that the approximate repeat circuit is the only one that functionally generalizes to the other settings.
  • Figure 4: Active attention patterns in ESM-3. (a--c) show example attention maps from three attention heads in the approximate-repeat circuit, for a single representative input: (a) fixed relative-position attention (diagonal); (b) induction attention between aligned repeat positions (two partial diagonals); and (c) amino-acid--biased attention (vertical). (d) Circuit attention heads are clustered and visualized using UMAP, colored by pattern type.
  • Figure 5: Circuit faithfulness of subsets of neurons in ESM-3. 3% of MLP neurons per layer are sufficient to recover most circuit faithfulness (80/85%), indicating sparsity in the neuron basis.
  • ...and 26 more figures