Phylogenetic Corrections and Higher-Order Sequence Statistics in Protein Families: The Potts Model vs MSA Transformer
Kisan Khatri, Ronald M. Levy, Allan Haldane
TL;DR
This work probes whether the physics-inspired Potts model or the data-driven MSA-Transformer (MSA-T) best captures biophysical constraints in protein MSAs when explicit phylogenetic corrections are applied. Using natural MSAs from the RR Domain and Protein Kinase families, they assess higher-order statistics via the $r_{20}$ metric and perform synthetic 6M-sequence tests to examine finite-sample effects, with careful identity-based filtering to approximate i.i.d. sampling. After phylogenetic correction, the Potts formulation $P(S | θ_F) ∝ exp(-∑_{i<j} J^{ij}_{s_i s_j})$ yields higher-order statistics that align with the null for well-specified marginals and outperforms MSA-T in detecting biophysical epistatic interactions, while MSA-T struggles on higher-order and connected statistics. The results underscore the importance of explicit phylogeny handling in evaluating GPSMs and suggest that the simpler Potts framework more faithfully encodes biophysical sequence constraints, with meaningful implications for protein design and interpretation.
Abstract
Recent generative learning models applied to protein multiple sequence alignment (MSA) datasets include simple and interpretable physics-based Potts covariation models and other machine learning models such as MSA-Transformer (MSA-T). The best models accurately reproduce MSA statistics induced by the biophysical constraints within proteins, raising the question of which functional forms best model the underlying physics. The Potts model is usually specified by an effective potential including pairwise residue-residue interaction terms, but it has been suggested that MSA-T can capture the effects induced by effective potentials which include more than pairwise interactions and implicitly account for phylogenetic structure in the MSA. Here we compare the ability of the Potts model and MSA-T to reconstruct higher-order sequence statistics reflecting complex biological sequence constraints. We find that the model performance depends greatly on the treatment of phylogenetic relationships between the sequences, which can induce non-biophysical mutational covariation in MSAs. When using explicit corrections for phylogenetic dependencies, we find the Potts model outperforms MSA-T in detecting epistatic interactions of biophysical origin.
