Table of Contents
Fetching ...

On Recovering Higher-order Interactions from Protein Language Models

Darin Tsui, Amirali Aghazadeh

TL;DR

Interpreting mutational interactions that drive protein language model predictions is intractable if one tries to exhaustively query the full space of $20^n$ sequences. The authors develop a Walsh-Hadamard Transform–based framework with sparsity $S_5$ and ruggedness metrics, using a $2^n$-sample proxy and a sparse Fourier transform ($q$-SFT) to recover higher-order terms, focusing on sparse regions of the full landscape. Across GFP, TP53, and GB1, ESM2 landscapes exhibit significant higher-order interactions and context-dependent sparsity and ruggedness, with $q$-SFT achieving NMSEs of $0.32$ ($R^2=0.66$) and $0.26$ ($R^2=0.72$) using about $7{,}040{,}000$ samples, representing roughly a $15{,}000$-fold reduction in sampling. This open-box framework enables scalable, interpretable mapping of mutational interactions in protein language models, with potential to inform protein engineering and disease mutation studies.

Abstract

Protein language models leverage evolutionary information to perform state-of-the-art 3D structure and zero-shot variant prediction. Yet, extracting and explaining all the mutational interactions that govern model predictions remains difficult as it requires querying the entire amino acid space for $n$ sites using $20^n$ sequences, which is computationally expensive even for moderate values of $n$ (e.g., $n\sim10$). Although approaches to lower the sample complexity exist, they often limit the interpretability of the model to just single and pairwise interactions. Recently, computationally scalable algorithms relying on the assumption of sparsity in the Fourier domain have emerged to learn interactions from experimental data. However, extracting interactions from language models poses unique challenges: it's unclear if sparsity is always present or if it is the only metric needed to assess the utility of Fourier algorithms. Herein, we develop a framework to do a systematic Fourier analysis of the protein language model ESM2 applied on three proteins-green fluorescent protein (GFP), tumor protein P53 (TP53), and G domain B1 (GB1)-across various sites for 228 experiments. We demonstrate that ESM2 is dominated by three regions in the sparsity-ruggedness plane, two of which are better suited for sparse Fourier transforms. Validations on two sample proteins demonstrate recovery of all interactions with $R^2=0.72$ in the more sparse region and $R^2=0.66$ in the more dense region, using only 7 million out of $20^{10}\sim10^{13}$ ESM2 samples, reducing the computational time by a staggering factor of 15,000. All codes and data are available on our GitHub repository https://github.com/amirgroup-codes/InteractionRecovery.

On Recovering Higher-order Interactions from Protein Language Models

TL;DR

Interpreting mutational interactions that drive protein language model predictions is intractable if one tries to exhaustively query the full space of sequences. The authors develop a Walsh-Hadamard Transform–based framework with sparsity and ruggedness metrics, using a -sample proxy and a sparse Fourier transform (-SFT) to recover higher-order terms, focusing on sparse regions of the full landscape. Across GFP, TP53, and GB1, ESM2 landscapes exhibit significant higher-order interactions and context-dependent sparsity and ruggedness, with -SFT achieving NMSEs of () and () using about samples, representing roughly a -fold reduction in sampling. This open-box framework enables scalable, interpretable mapping of mutational interactions in protein language models, with potential to inform protein engineering and disease mutation studies.

Abstract

Protein language models leverage evolutionary information to perform state-of-the-art 3D structure and zero-shot variant prediction. Yet, extracting and explaining all the mutational interactions that govern model predictions remains difficult as it requires querying the entire amino acid space for sites using sequences, which is computationally expensive even for moderate values of (e.g., ). Although approaches to lower the sample complexity exist, they often limit the interpretability of the model to just single and pairwise interactions. Recently, computationally scalable algorithms relying on the assumption of sparsity in the Fourier domain have emerged to learn interactions from experimental data. However, extracting interactions from language models poses unique challenges: it's unclear if sparsity is always present or if it is the only metric needed to assess the utility of Fourier algorithms. Herein, we develop a framework to do a systematic Fourier analysis of the protein language model ESM2 applied on three proteins-green fluorescent protein (GFP), tumor protein P53 (TP53), and G domain B1 (GB1)-across various sites for 228 experiments. We demonstrate that ESM2 is dominated by three regions in the sparsity-ruggedness plane, two of which are better suited for sparse Fourier transforms. Validations on two sample proteins demonstrate recovery of all interactions with in the more sparse region and in the more dense region, using only 7 million out of ESM2 samples, reducing the computational time by a staggering factor of 15,000. All codes and data are available on our GitHub repository https://github.com/amirgroup-codes/InteractionRecovery.
Paper Structure (7 sections, 2 equations, 4 figures)

This paper contains 7 sections, 2 equations, 4 figures.

Figures (4)

  • Figure 1: Schematic of our ESM2 Fourier analysis framework. ESM2, a protein language model trained for masked language modeling, predicts amino acids at various positions. Using a fixed mutation, we query the entire combinatorial space for selected positions and compute the Fourier transform to recover important interactions.
  • Figure 2: Ruggedness and sparsity across different ESM2 landscapes for GFP, TP53, and GB1. Site selection was based on experimental literature or random sampling from secondary structures and random coils. Our results demonstrate the presence of higher-order interactions, context sequence dependence of sparsity and ruggedness, and identification of regions for sparse Fourier transform.
  • Figure 3: Scatter plot of the predicted ESM scores using recovered Fourier coefficients on ten empirical (a) and random coil sites (b) from GB1 over the entire amino acid space. Sparse Fourier transforms recover most interactions with an NMSE of 0.32 ($R^2 = 0.66)$ and 0.26 ($R^2 = 0.72$), respectively, highlighting the recovery of interactions in ESM2 in sparse and rugged landscapes.
  • Figure 4: Ruggedness and sparsity across different ESM2 landscapes for GFP, TP53, and GB1 using ESM scores from Meier2021. While landscapes from Meier2021 tend to be less rugged and more sparse than landscapes from Brandes23, both ESM scores demonstrate the presence of higher-order interactions.