Table of Contents
Fetching ...

Loop-Diffusion: an equivariant diffusion model for designing and scoring protein loops

Kevin Borisiak, Gian Marco Visani, Armita Nourmohammad

TL;DR

Loop-Diffusion is presented, an energy based diffusion model which leverages a dataset of general protein loops from the entire protein universe to learn an energy function that generalizes to functional prediction tasks and is evaluated on scoring TCR-pMHC interfaces.

Abstract

Predicting protein functional characteristics from structure remains a central problem in protein science, with broad implications from understanding the mechanisms of disease to designing novel therapeutics. Unfortunately, current machine learning methods are limited by scarce and biased experimental data, and physics-based methods are either too slow to be useful, or too simplified to be accurate. In this work, we present Loop-Diffusion, an energy based diffusion model which leverages a dataset of general protein loops from the entire protein universe to learn an energy function that generalizes to functional prediction tasks. We evaluate Loop-Diffusion's performance on scoring TCR-pMHC interfaces and demonstrate state-of-the-art results in recognizing binding-enhancing mutations.

Loop-Diffusion: an equivariant diffusion model for designing and scoring protein loops

TL;DR

Loop-Diffusion is presented, an energy based diffusion model which leverages a dataset of general protein loops from the entire protein universe to learn an energy function that generalizes to functional prediction tasks and is evaluated on scoring TCR-pMHC interfaces.

Abstract

Predicting protein functional characteristics from structure remains a central problem in protein science, with broad implications from understanding the mechanisms of disease to designing novel therapeutics. Unfortunately, current machine learning methods are limited by scarce and biased experimental data, and physics-based methods are either too slow to be useful, or too simplified to be accurate. In this work, we present Loop-Diffusion, an energy based diffusion model which leverages a dataset of general protein loops from the entire protein universe to learn an energy function that generalizes to functional prediction tasks. We evaluate Loop-Diffusion's performance on scoring TCR-pMHC interfaces and demonstrate state-of-the-art results in recognizing binding-enhancing mutations.
Paper Structure (11 sections, 5 equations, 5 figures, 2 algorithms)

This paper contains 11 sections, 5 equations, 5 figures, 2 algorithms.

Figures (5)

  • Figure 1: Schematic of Loop-Diffusion.A) Loop extraction pipeline. We consider all loops of length between 4 and 20 residues, and atoms in their 10 Å neighborhoods. B) Loop-Diffusion is an energy-based model trained with the DDPM objective.
  • Figure 2: Peptides and CDR3 mutational effects on the binding affinity of TCR-pMHC complexes. The predictions for the effect of mutations on binding affinity in peptides (left) and in the CDR3 loop of TRCs (right) is compared across models for TCR-pMHC complexes of (A) Human MHC-I system (n=52 for peptides, n=37 for CDR3), and (B) all the TCR-pMHC complexes in the ATLAS database with a mutation in the peptide or the CDR3 Loop (n=55 for both peptides and CDR3's). All panels show correlation coefficients between the experimental data and model predictions (left), and insignificant correlations (p-value>0.05) are indicated in red. All panels show ROC curves and the corresponding AUROC for all models to classify between favorable ($\Delta\Delta G_{binding}\geq0$) and unfavorable ($\Delta\Delta G_{binding}<0$) mutations in each set (right).
  • Figure S1: Distribution of loop lengths within our dataset extracted from CASP12. The peptides in our test dataset range from 9-13 residues in length, which is relatively well represented in our training dataset. CDR3's loops, however, are typically in the range of 12-16 residues in length, which is a more data scarce regime. Future work may attempt to crop the CDR3 loop around the mutation to see if performance is improved.
  • Figure S2: Scatterplots of predicted vs. experimental binding $\Delta \Delta G$ on mutations occurring on peptides only (left columns) or one CDR3 only (right column), from the subset of the ATLAS dataset containing only Human MHC Class-I systems.
  • Figure S3: Scatterplots of predicted vs. experimental binding $\Delta \Delta G$ on mutations occurring on peptides only (left columns) or one CDR3 only (right column), from the ATLAS dataset.