Table of Contents
Fetching ...

Fast and Interpretable Protein Substructure Alignment via Optimal Transport

Zhiyu Wang, Bingxin Zhou, Jing Wang, Yang Tan, Weishu Zhao, Pietro Liò, Liang Hong

TL;DR

PLASMA introduces an entropy-regularized optimal transport framework for residue-level protein substructure alignment, delivering interpretable alignment matrices and a normalized similarity score via differentiable Sinkhorn iterations. It supports both trainable (PLASMA) and training-free (PLASMA-PF) variants, with a Label Match Loss to guide localization when annotations exist. Across interpolation and extrapolation tasks on diverse backbone representations, PLASMA achieves superior accuracy and efficiency (≈10 ms per protein pair) relative to global and local baselines, while preserving interpretability of alignments. The method enables robust detection and localization of functional motifs across proteins with varying sequences and folds, offering practical value for functional annotation, evolution studies, and structure-guided design.

Abstract

Proteins are essential biological macromolecules that execute life functions. Local motifs within protein structures, such as active sites, are the most critical components for linking structure to function and are key to understanding protein evolution and enabling protein engineering. Existing computational methods struggle to identify and compare these local structures, which leaves a significant gap in understanding protein structures and harnessing their functions. This study presents PLASMA, the first deep learning framework for efficient and interpretable residue-level protein substructure alignment. We reformulate the problem as a regularized optimal transport task and leverage differentiable Sinkhorn iterations. For a pair of input protein structures, PLASMA outputs a clear alignment matrix with an interpretable overall similarity score. Through extensive quantitative evaluations and three biological case studies, we demonstrate that PLASMA achieves accurate, lightweight, and interpretable residue-level alignment. Additionally, we introduce PLASMA-PF, a training-free variant that provides a practical alternative when training data are unavailable. Our method addresses a critical gap in protein structure analysis tools and offers new opportunities for functional annotation, evolutionary studies, and structure-based drug design. Reproducibility is ensured via our official implementation at https://github.com/ZW471/PLASMA-Protein-Local-Alignment.git.

Fast and Interpretable Protein Substructure Alignment via Optimal Transport

TL;DR

PLASMA introduces an entropy-regularized optimal transport framework for residue-level protein substructure alignment, delivering interpretable alignment matrices and a normalized similarity score via differentiable Sinkhorn iterations. It supports both trainable (PLASMA) and training-free (PLASMA-PF) variants, with a Label Match Loss to guide localization when annotations exist. Across interpolation and extrapolation tasks on diverse backbone representations, PLASMA achieves superior accuracy and efficiency (≈10 ms per protein pair) relative to global and local baselines, while preserving interpretability of alignments. The method enables robust detection and localization of functional motifs across proteins with varying sequences and folds, offering practical value for functional annotation, evolution studies, and structure-guided design.

Abstract

Proteins are essential biological macromolecules that execute life functions. Local motifs within protein structures, such as active sites, are the most critical components for linking structure to function and are key to understanding protein evolution and enabling protein engineering. Existing computational methods struggle to identify and compare these local structures, which leaves a significant gap in understanding protein structures and harnessing their functions. This study presents PLASMA, the first deep learning framework for efficient and interpretable residue-level protein substructure alignment. We reformulate the problem as a regularized optimal transport task and leverage differentiable Sinkhorn iterations. For a pair of input protein structures, PLASMA outputs a clear alignment matrix with an interpretable overall similarity score. Through extensive quantitative evaluations and three biological case studies, we demonstrate that PLASMA achieves accurate, lightweight, and interpretable residue-level alignment. Additionally, we introduce PLASMA-PF, a training-free variant that provides a practical alternative when training data are unavailable. Our method addresses a critical gap in protein structure analysis tools and offers new opportunities for functional annotation, evolutionary studies, and structure-based drug design. Reproducibility is ensured via our official implementation at https://github.com/ZW471/PLASMA-Protein-Local-Alignment.git.

Paper Structure

This paper contains 46 sections, 8 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: PLASMA Overview. PLASMA converts residue-level protein embeddings into substructure alignments using optimal transport. A Transport Planner learns cost matrices with Sinkhorn iterations, and a Plan Assessor produces similarity scores. The framework provides alignment matrices and quantitative scores without requiring model-specific designs.
  • Figure 2: Performance versus computational efficiency comparison. ROC-AUC scores plotted against inference time (milliseconds) for motif and binding/active site detection using ProstT5 embeddings. Points represent averages across three splits with standard error bars on both axes.
  • Figure 5: Representative alignment examples across three protein pairs. A, P40343 vs Q8K0L0. B, P64215 vs C0H419. C, Q69ZS8 vs Q86W92. Left: 3D structures with highlighted aligned regions. Center and right: alignment matrices from PLASMA and EBA with zoomed insets.
  • Figure 6: Representative alignment matrices comparing query protein P76129 against six candidate proteins. The visualization shows four positive pairs (POS) with shared substructures and two negative pairs (NEG) without substructure similarity. Orange regions highlight aligned substructures.
  • Figure 7: Effect of Sinkhorn temperature parameter $\tau$ on alignment matrix and score for both PLASMA and PLASMA-PF variants.
  • ...and 9 more figures