Table of Contents
Fetching ...

Phylogenetic Corrections and Higher-Order Sequence Statistics in Protein Families: The Potts Model vs MSA Transformer

Kisan Khatri, Ronald M. Levy, Allan Haldane

TL;DR

This work probes whether the physics-inspired Potts model or the data-driven MSA-Transformer (MSA-T) best captures biophysical constraints in protein MSAs when explicit phylogenetic corrections are applied. Using natural MSAs from the RR Domain and Protein Kinase families, they assess higher-order statistics via the $r_{20}$ metric and perform synthetic 6M-sequence tests to examine finite-sample effects, with careful identity-based filtering to approximate i.i.d. sampling. After phylogenetic correction, the Potts formulation $P(S | θ_F) ∝ exp(-∑_{i<j} J^{ij}_{s_i s_j})$ yields higher-order statistics that align with the null for well-specified marginals and outperforms MSA-T in detecting biophysical epistatic interactions, while MSA-T struggles on higher-order and connected statistics. The results underscore the importance of explicit phylogeny handling in evaluating GPSMs and suggest that the simpler Potts framework more faithfully encodes biophysical sequence constraints, with meaningful implications for protein design and interpretation.

Abstract

Recent generative learning models applied to protein multiple sequence alignment (MSA) datasets include simple and interpretable physics-based Potts covariation models and other machine learning models such as MSA-Transformer (MSA-T). The best models accurately reproduce MSA statistics induced by the biophysical constraints within proteins, raising the question of which functional forms best model the underlying physics. The Potts model is usually specified by an effective potential including pairwise residue-residue interaction terms, but it has been suggested that MSA-T can capture the effects induced by effective potentials which include more than pairwise interactions and implicitly account for phylogenetic structure in the MSA. Here we compare the ability of the Potts model and MSA-T to reconstruct higher-order sequence statistics reflecting complex biological sequence constraints. We find that the model performance depends greatly on the treatment of phylogenetic relationships between the sequences, which can induce non-biophysical mutational covariation in MSAs. When using explicit corrections for phylogenetic dependencies, we find the Potts model outperforms MSA-T in detecting epistatic interactions of biophysical origin.

Phylogenetic Corrections and Higher-Order Sequence Statistics in Protein Families: The Potts Model vs MSA Transformer

TL;DR

This work probes whether the physics-inspired Potts model or the data-driven MSA-Transformer (MSA-T) best captures biophysical constraints in protein MSAs when explicit phylogenetic corrections are applied. Using natural MSAs from the RR Domain and Protein Kinase families, they assess higher-order statistics via the metric and perform synthetic 6M-sequence tests to examine finite-sample effects, with careful identity-based filtering to approximate i.i.d. sampling. After phylogenetic correction, the Potts formulation yields higher-order statistics that align with the null for well-specified marginals and outperforms MSA-T in detecting biophysical epistatic interactions, while MSA-T struggles on higher-order and connected statistics. The results underscore the importance of explicit phylogeny handling in evaluating GPSMs and suggest that the simpler Potts framework more faithfully encodes biophysical sequence constraints, with meaningful implications for protein design and interpretation.

Abstract

Recent generative learning models applied to protein multiple sequence alignment (MSA) datasets include simple and interpretable physics-based Potts covariation models and other machine learning models such as MSA-Transformer (MSA-T). The best models accurately reproduce MSA statistics induced by the biophysical constraints within proteins, raising the question of which functional forms best model the underlying physics. The Potts model is usually specified by an effective potential including pairwise residue-residue interaction terms, but it has been suggested that MSA-T can capture the effects induced by effective potentials which include more than pairwise interactions and implicitly account for phylogenetic structure in the MSA. Here we compare the ability of the Potts model and MSA-T to reconstruct higher-order sequence statistics reflecting complex biological sequence constraints. We find that the model performance depends greatly on the treatment of phylogenetic relationships between the sequences, which can induce non-biophysical mutational covariation in MSAs. When using explicit corrections for phylogenetic dependencies, we find the Potts model outperforms MSA-T in detecting epistatic interactions of biophysical origin.

Paper Structure

This paper contains 6 sections, 5 figures.

Figures (5)

  • Figure 1: Phylogenetic relationships between sequences in an MSA result in a spurious mutational correlation due to common ancestry, for R/V and Q/T combinations at the illustrated column-pair. Sequence pairs greater than an identity cut-off (diverging to the left of the gray dotted line) approximate i.i.d. samples, so that identity filtering to retain 3 sequences labeled in blue gives an unbiased sample showing no correlation with equal frequency of R/V, Q/V and Q/T combinations.
  • Figure 2: Overview of statistical tests carried out using the Potts model and MSA-T, to isolate different forms of statistical error. Boxes represent MSAs with different amounts of sequences shown for the RR-domain family, which are then filtered, split, or used to train and generate from the GPSMs (arrows). Our tests measure the statistical difference between the "evaluation" MSAs and the corresponding "reference" MSAs.
  • Figure 3: "Natural" GPSM performance test in which the training and reference MSAs are natural sequences filtered by sequence identity to eliminate phylogenetic redundancy, and evaluated using the $r_{20}$ metric for (a) RR-domain (MSAs of 6K) (b) Kinase Protein (MSAs of 10K).
  • Figure 4: "Synthetic" GPSM performance test for the RR Domain in which large (6M) training and reference MSAs are produced by an initial GPSM, which is a Potts model in (a),(c), and MSA-T in (b),(d). The $r_{20}$ metric is used in (a),(b), and a cc-$r_{20}$ metric in (c) and (d) .
  • Figure 5: Tests of the impact of identity filtering, for the RR-domain family (a) Models using 30K unfiltered natural sequences for both training and evaluation. (b) The Potts and Independent models trained on 6K filtered natural sequences; MSA-T was trained on 30K unfiltered sequences. The reference MSA is 30K unfiltered. (c) $r_{20}$ value for all models, evaluated against 6K filtered natural sequences.