Table of Contents
Fetching ...

Higher-Order Modular Attention: Fusing Pairwise and Triadic Interactions for Protein Sequences

Shirin Amiraslani, Xin Gao

Abstract

Transformer self-attention computes pairwise token interactions, yet protein sequence to phenotype relationships often involve cooperative dependencies among three or more residues that dot product attention does not capture explicitly. We introduce Higher-Order Modular Attention, HOMA, a unified attention operator that fuses pairwise attention with an explicit triadic interaction pathway. To make triadic attention practical on long sequences, HOMA employs block-structured, windowed triadic attention. We evaluate on three TAPE benchmarks for Secondary Structure, Fluorescence, and Stability. Our attention mechanism yields consistent improvements across all tasks compared with standard self-attention and efficient variants including block-wise attention and Linformer. These results suggest that explicit triadic terms provide complementary representational capacity for protein sequence prediction at controllable additional computational cost.

Higher-Order Modular Attention: Fusing Pairwise and Triadic Interactions for Protein Sequences

Abstract

Transformer self-attention computes pairwise token interactions, yet protein sequence to phenotype relationships often involve cooperative dependencies among three or more residues that dot product attention does not capture explicitly. We introduce Higher-Order Modular Attention, HOMA, a unified attention operator that fuses pairwise attention with an explicit triadic interaction pathway. To make triadic attention practical on long sequences, HOMA employs block-structured, windowed triadic attention. We evaluate on three TAPE benchmarks for Secondary Structure, Fluorescence, and Stability. Our attention mechanism yields consistent improvements across all tasks compared with standard self-attention and efficient variants including block-wise attention and Linformer. These results suggest that explicit triadic terms provide complementary representational capacity for protein sequence prediction at controllable additional computational cost.
Paper Structure (19 sections, 12 equations, 4 figures, 2 tables)

This paper contains 19 sections, 12 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Transformer Layer with higher-order Modular Attention. HOMA replaces the self-attention sublayer and computes pairwise and triadic multi-head attention in parallel from shared projections $Q$, $K$, $V$, and $U$. Their outputs are concatenated, fused, and mapped through the standard output projection.
  • Figure 2: Illustration of the three TAPE benchmark tasks. (a) Secondary Structure prediction. A per residue classification task in which each amino acid is assigned one of three structural labels: helix (H), strand (S), or coil (C). The panel shows an example protein segment with predicted per residue labels indicated below the sequence. (b) Fluorescence prediction. A sequence-level regression task over green fluorescent protein variants. Concentric regions represent increasing Hamming distance, $H_d$, from the parent sequence, where the inner disk contains variants with $H_d \le 3$ and the outer ring contains variants with $H_d \ge 4$. Color encodes measured Fluorescence intensity, ranging from dark to bright. (c) Stability prediction. A sequence-level regression task over designed protein variants, where the target is a continuous thermodynamic Stability score. High performing sequences from the training rounds serve as reference points, and test sequences are constructed as single point mutants at Hamming distance one from these top designs. Color encodes the measured Stability score, with darker shades indicating lower Stability.
  • Figure 3: Efficiency and convergence analysis of HOMA and baseline attention mechanisms. (a) Validation loss over the first six training epochs on the Secondary Structure task, comparing all baseline and HOMA variants across window sizes $w \in \{3, 5, 7\}$. (b) Compute only throughput in token positions per second, measured on Secondary Structure and Fluorescence for all model variants across the three HOMA window sizes and the three pairwise baselines. (c) Peak GPU memory allocation in gigabytes on the same two tasks and the same set of model variants as panel (b).
  • Figure 4: Ablation studies for HOMA on Secondary Structure prediction across CASP12, TS115, and CB513. All bars report three-class accuracy (Q3). (a) Rank ablation at window size $w{=}7$. The rank of the triadic $U$-projection is varied across full rank, rank 128, rank 32, and rank 8 to evaluate the effect of low-rank factorization on predictive accuracy. (b) Pretraining and freezing of pairwise attention at window size $w{=}7$. Three training configurations are compared: HOMA trained from scratch, HOMA initialized from a pretrained pairwise backbone with continued joint optimization, and HOMA initialized from a pretrained pairwise backbone with transferred weights frozen during triadic training. (c) Effect of maximum sequence length at window size $w{=}5$. Training runs are compared under maximum sequence lengths of 256, 512, and 1024 to evaluate the sensitivity of Q3 accuracy to the sequence-length budget.