Table of Contents
Fetching ...

In-Context Learning can distort the relationship between sequence likelihoods and biological fitness

Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta

TL;DR

This paper shows that in-context learning in biological sequence language models can distort the relationship between sequence likelihood and fitness by enabling look-up retrieval from contextual repeats. Using multiple protein and RNA language models, the authors demonstrate an in-context retrieval mechanism that collapses uncertainty for repeated motifs, sometimes overriding learned priors and degrading embedding quality when repeats are extensive. The work reveals architecture- and data-dependent differences in this phenomenon and discusses implications for interpreting model-based fitness predictions and for designing robust design workflows. It highlights the need for careful evaluation of context-driven effects and suggests that retrieval-based distortions could extend to biomolecular structure models as well.

Abstract

Language models have emerged as powerful predictors of the viability of biological sequences. During training these models learn the rules of the grammar obeyed by sequences of amino acids or nucleotides. Once trained, these models can take a sequence as input and produce a likelihood score as an output; a higher likelihood implies adherence to the learned grammar and correlates with experimental fitness measurements. Here we show that in-context learning can distort the relationship between fitness and likelihood scores of sequences. This phenomenon most prominently manifests as anomalously high likelihood scores for sequences that contain repeated motifs. We use protein language models with different architectures trained on the masked language modeling objective for our experiments, and find transformer-based models to be particularly vulnerable to this effect. This behavior is mediated by a look-up operation where the model seeks the identity of the masked position by using the other copy of the repeated motif as a reference. This retrieval behavior can override the model's learned priors. This phenomenon persists for imperfectly repeated sequences, and extends to other kinds of biologically relevant features such as reversed complement motifs in RNA sequences that fold into hairpin structures.

In-Context Learning can distort the relationship between sequence likelihoods and biological fitness

TL;DR

This paper shows that in-context learning in biological sequence language models can distort the relationship between sequence likelihood and fitness by enabling look-up retrieval from contextual repeats. Using multiple protein and RNA language models, the authors demonstrate an in-context retrieval mechanism that collapses uncertainty for repeated motifs, sometimes overriding learned priors and degrading embedding quality when repeats are extensive. The work reveals architecture- and data-dependent differences in this phenomenon and discusses implications for interpreting model-based fitness predictions and for designing robust design workflows. It highlights the need for careful evaluation of context-driven effects and suggests that retrieval-based distortions could extend to biomolecular structure models as well.

Abstract

Language models have emerged as powerful predictors of the viability of biological sequences. During training these models learn the rules of the grammar obeyed by sequences of amino acids or nucleotides. Once trained, these models can take a sequence as input and produce a likelihood score as an output; a higher likelihood implies adherence to the learned grammar and correlates with experimental fitness measurements. Here we show that in-context learning can distort the relationship between fitness and likelihood scores of sequences. This phenomenon most prominently manifests as anomalously high likelihood scores for sequences that contain repeated motifs. We use protein language models with different architectures trained on the masked language modeling objective for our experiments, and find transformer-based models to be particularly vulnerable to this effect. This behavior is mediated by a look-up operation where the model seeks the identity of the masked position by using the other copy of the repeated motif as a reference. This retrieval behavior can override the model's learned priors. This phenomenon persists for imperfectly repeated sequences, and extends to other kinds of biologically relevant features such as reversed complement motifs in RNA sequences that fold into hairpin structures.

Paper Structure

This paper contains 9 sections, 7 figures.

Figures (7)

  • Figure 1: Proteins with repeated motifs have strikingly low pseudo-perplexity scores A visualization of the ESM2 OFS pseudo-perplexity distribution of protein domains parsed from a diverse set of protein sequences. We manually scanned through the set of domains that have low pseudo-perplexity scores (left panel: values from 1 to 1.5 shaded in green in the plot). We found that a significant proportion of these low pseudo-perplexity sequences have repeated motifs in them. We have visualized the sequences of five such protein domains in the right panel (green shaded box). The text for each of these five sequences has been wrapped in a way to make the repetitive structure apparent.
  • Figure 2: Repetition can induce an uncertainty collapse in masked language models Pseudo-perplexity is a measure of the model's uncertainty in its predictions with a value of 1 signifying complete and absolute certainty. (A) We show violin plots of the distribution of pseudo-perplexity scores for 1000 protein domains. The 1x distribution in blue denotes natural protein sequences and the 2x distribution in orange denotes doubled sequences. A doubled sequence is generated by appending a copy of a sequence to itself. Transformer-based masked language models (ESM2) exhibit an uncertainty collapse when presented with doubled sequences -- the pseudo-perplexity of the doubled sequences is approximately one, the lowest value that it can take. Progen-M, an autoregressive transformer-based model exhibits a significant decline in its perplexity for doubled sequences. However, the convolutions-based masked language model (CARP) exhibits markedly different behavior. (B) Randomly generated sequences with the same lengths as natural protein domains also exhibit an uncertainty collapse after being doubled in ESM2. Likewise, we observe a sharp decline in the perplexity of these doubled random sequences in Progen-M. (C) CARP (640M) exhibits uncertainty collapse for protein sequences shorter than approximately 70 residues. It also exhibits an uncertainty collapse for randomly generated repeating units of size 20. However, we only observe a slow progressive decline for repeating units of size 100 with an increase in the multiplicity. The model is particularly susceptible to an increase in the multiplicity of repeating units of size 70. (D) LC-PLM (1.4B), a BiMamba-S based protein language model does not exhibit uncertainty collapse, even for short repeating units that vary between 5-9 residues at multiplicities as high as 32x. It only shows a progressive decline in uncertainty with increasing multiplicity of the repeats. The decline in pseudo-perplexity from 4x to 32x units is most pronounced for the shortest repeating unit of 5 residues. The effect gets weaker as the size of the repeating unit increases.
  • Figure 3: In-context retrieval can override learned priors (A) A depiction of a Calmodulin-binding motif (Q622K8:757-779) that is masked at position 13. The model's prediction for the masked position indicates a preference for residues with large hydrophobic side-chains. This aligns with the identity of the actual residue in the position -- W. (B) Adding the second domain dramatically reduces the uncertainty of the model. The model is confident that W is the only sensible option for the masked position. (C) Adding a mask at the equivalent position in the second copy of the domain brings back the uncertainty in the prediction of the model. (D) Adding a mask at a non-equivalent position does not affect the model's confidence. (E) Scaling the double-masking experiment to several thousand domains establishes that masking the equivalent position is what causes the return in the uncertainty of the model, as seen by the entropy distribution of the predictions.
  • Figure 4: (F) Changing the residue at the equivalent position to K also changes the model's prediction correspondingly. The fact that the model is otherwise aware that this residue does not fit in the given context does not deter it from making this prediction. (G) The mask is placed at position 4 in the first copy of the domain. There are two equivalent positions in the second copy that the model can use for retrieval. In this case, the model prefers the residue on the right -- R. (H) The mask is placed at position 21 in the first copy of the domain. There are two equivalent positions that the model can use for retrieval. In this case, the model prefers the residue on the left -- L. (I) We scale the setup described in F to thousands of protein domains (leftmost panel) as well as randomly generated sequences (rightmost panel). Each row of the matrix is the averaged probability vector of the masked position when the equivalent position in the second copy is substituted by the amino acid corresponding to that row. We find that most of the probability mass is concentrated in the diagonal entries of the profile matrix for both protein domains (leftmost panel) and random sequences (rightmost panel). This means that the model's prediction for the masked position skews towards whatever residue is present at the equivalent position in the second copy. For natural protein domains, the probability mass for W and C is markedly lower than other residues (leftmost panel -- bottom two rows with dull green diagonal entries), indicating that changing the equivalent residue to W or C does not induce as strong of a flipping response in the model's prediction for the masked position. We do not observe this behavior for randomly generated sequences (rightmost panel -- all diagonal entries are bright yellow). (J) A plot to show the model's contra-lateral preference for the retrieval position for the setup described in G and H: the model prefers the residue on the right for the retrieval operation when the mask is placed towards the left end of the sequence, and it prefers the residue on the left when the mask is placed on the right end.
  • Figure 5: In-context learning extends well beyond perfect repeats (A) A representation of imperfectly repeated sequences. Mutations are added in the appended copy of the protein domain to assess if the in-context learning effect persists for imperfect repeats. (B) The local ESM2 OFS pseudo-perplexity of the mutated protein domain is significantly lower when it appears alongside the natural domain as compared to when it appears in isolation.
  • ...and 2 more figures