Table of Contents
Fetching ...

Training-Free Generation of Protein Sequences from Small Family Alignments via Stochastic Attention

Jeffrey D. Varner

Abstract

Most protein families have fewer than 100 known members, a regime where deep generative models overfit or collapse. We propose stochastic attention (SA), a training-free sampler that treats the modern Hopfield energy over a protein alignment as a Boltzmann distribution and draws samples via Langevin dynamics. The score function is a closed-form softmax attention operation requiring no training, no pretraining data, and no GPU, with cost linear in alignment size. Across eight Pfam families, SA generates sequences with low amino acid compositional divergence, substantial novelty, and structural plausibility confirmed by ESMFold and AlphaFold2. Generated sequences fold more faithfully to canonical family structures than natural members in six of eight families. Against profile HMMs, EvoDiff, and the MSA Transformer, which produce sequences that drift far outside the family, SA maintains 51 to 66 percent identity while remaining novel, in seconds on a laptop. The critical temperature governing generation is predicted from PCA dimensionality alone, enabling fully automatic operation. Controls confirm SA encodes correlated substitution patterns, not just per-position amino acid frequencies.

Training-Free Generation of Protein Sequences from Small Family Alignments via Stochastic Attention

Abstract

Most protein families have fewer than 100 known members, a regime where deep generative models overfit or collapse. We propose stochastic attention (SA), a training-free sampler that treats the modern Hopfield energy over a protein alignment as a Boltzmann distribution and draws samples via Langevin dynamics. The score function is a closed-form softmax attention operation requiring no training, no pretraining data, and no GPU, with cost linear in alignment size. Across eight Pfam families, SA generates sequences with low amino acid compositional divergence, substantial novelty, and structural plausibility confirmed by ESMFold and AlphaFold2. Generated sequences fold more faithfully to canonical family structures than natural members in six of eight families. Against profile HMMs, EvoDiff, and the MSA Transformer, which produce sequences that drift far outside the family, SA maintains 51 to 66 percent identity while remaining novel, in seconds on a laptop. The critical temperature governing generation is predicted from PCA dimensionality alone, enabling fully automatic operation. Controls confirm SA encodes correlated substitution patterns, not just per-position amino acid frequencies.
Paper Structure (42 sections, 17 equations, 16 figures, 15 tables, 1 algorithm)

This paper contains 42 sections, 17 equations, 16 figures, 15 tables, 1 algorithm.

Figures (16)

  • Figure 1: Cross-family comparison of SA generation, SA retrieval, and bootstrap across eight Pfam families. (A) KL divergence of amino acid composition (lower is better). SA generation achieves the lowest or near-lowest KL in every family. (B) Novelty in PCA space ($1 - \max_k \cos(\hat{\xi}, \mathbf{m}_k)$; higher is better). SA generation produces substantially novel sequences, while bootstrap samples, which are exact copies of stored patterns, have zero PCA-space novelty by construction. (C) Nearest sequence identity to a stored pattern (lower indicates greater departure). Error bars show $\pm 1$ SE (30 chains $\times$ 5 samples).
  • Figure 2: Predicting the critical temperature from MSA statistics. (A) Predicted vs. empirical $\beta^*$ for the simple linear model $\beta^* \approx 1.57 + 0.28\sqrt{d}$ ($R^2{=}0.97$, $n{=}33$). Blue circles: eight Pfam families; gold diamonds: WW scaling replicates ($K \in \{20,\ldots,400\}$, five repeats each). (B)$\beta^*$ as a function of $\sqrt{d}$, with the fitted regression (dashed) and the random-pattern prediction $\beta^* = \sqrt{d}$ (dotted red). The intercept offset and reduced slope reflect structured correlations in real protein families. (C) Model comparison (log-scale RMSE). The simple $\sqrt{d}$ model (RMSE${=}0.16$) is $33\times$ more accurate than the naive random-pattern prediction $\beta^* = \sqrt{d}$ (RMSE${=}5.29$); adding MSA features provides negligible improvement.
  • Figure 3: Head-to-head comparison of SA generation against three learned baselines (profile HMM emit, EvoDiff, MSA Transformer) across all eight Pfam families. (A) KL divergence of amino acid composition (pooled over all 150 generated sequences, with bootstrap SE; $n_{\mathrm{boot}}{=}1000$). (B) Novelty in PCA space. (C) Nearest sequence identity. Panels (B) and (C) report per-chain means $\pm 1$ SE (30 chains $\times$ 5 samples). Simple reference baselines (Gaussian perturbation, convex combination, stored-pattern retrieval) are omitted from this figure because their design precludes meaningful comparison: retrieval returns memorized sequences (novelty ${\approx}0$, identity ${\approx}1$) and the perturbation baselines do not attempt to model the family distribution. Their metrics are reported in SI Appendix, Table \ref{['tab:results']} for completeness.
  • Figure 4: Structure validation of SA-generated sequences using ESMFold. (A) Mean pLDDT (folding confidence) across all eight Pfam families for SA generation, SA retrieval, bootstrap, and stored (natural) sequences. The dashed line indicates the pLDDT${=}70$ confidence threshold. (B) TM-score to a representative experimentally determined structure for each family. The dashed line indicates TM${=}0.5$ (same-fold threshold). Significance brackets show Wilcoxon rank-sum tests comparing SA generation to stored sequences: $^{***}p < 0.001$, $^{**}p < 0.01$, $^{*}p < 0.05$, n.s. = not significant. Error bars show $\pm 1$ SE ($n{=}50$ sequences per condition).
  • Figure 5: AlphaFold2 predicted structures for representative SA-generated sequences (colored by per-residue pLDDT confidence) superimposed on the experimentally determined reference structure for each family (gray, semi-transparent). For each family, the SA-generated sequence with the highest mean pLDDT was selected and aligned to the reference using PyMOL. The close predicted structural agreement (sub-angstrom RMSD) and uniformly high pLDDT scores are consistent with SA-generated sequences, despite substantial sequence-level novelty, encoding three-dimensional folds that recapitulate the canonical architecture of their target family, though experimental structure determination would be needed for definitive confirmation.
  • ...and 11 more figures