A framework for conditional diffusion modelling with applications in motif scaffolding for protein design

Kieran Didi; Francisco Vargas; Simon V Mathis; Vincent Dutordoir; Emile Mathieu; Urszula J Komorowska; Pietro Lio

A framework for conditional diffusion modelling with applications in motif scaffolding for protein design

Kieran Didi, Francisco Vargas, Simon V Mathis, Vincent Dutordoir, Emile Mathieu, Urszula J Komorowska, Pietro Lio

TL;DR

This work unify conditional training and conditional sampling procedures under one common framework based on the mathematically well-understood Doob's h-transform, which allows to draw connections between existing methods and propose a new variation on existing conditional training protocols.

Abstract

Many protein design applications, such as binder or enzyme design, require scaffolding a structural motif with high precision. Generative modelling paradigms based on denoising diffusion processes emerged as a leading candidate to address this motif scaffolding problem and have shown early experimental success in some cases. In the diffusion paradigm, motif scaffolding is treated as a conditional generation task, and several conditional generation protocols were proposed or imported from the Computer Vision literature. However, most of these protocols are motivated heuristically, e.g. via analogies to Langevin dynamics, and lack a unifying framework, obscuring connections between the different approaches. In this work, we unify conditional training and conditional sampling procedures under one common framework based on the mathematically well-understood Doob's h-transform. This new perspective allows us to draw connections between existing methods and propose a new variation on existing conditional training protocols. We illustrate the effectiveness of this new protocol in both, image outpainting and motif scaffolding and find that it outperforms standard methods.

A framework for conditional diffusion modelling with applications in motif scaffolding for protein design

TL;DR

Abstract

Paper Structure (43 sections, 5 theorems, 30 equations, 5 figures, 3 tables, 9 algorithms)

This paper contains 43 sections, 5 theorems, 30 equations, 5 figures, 3 tables, 9 algorithms.

Introduction
Theory: Conditioning diffusions via the $h$-transform, a new perspective
Doob's $h$-transform with hard constraint
Hard constraint
Reconstruction guidance
Generalised $h$-transform for soft constraints
Amortised training of $h$-transform
Conditional Finetuning - Learning the Generalised $h$-transform in Noisy Inverse Problems
Experimental results
Conditional image generation.
Conditional protein design: motif scaffolding
Data
Diffusion process
Noise model
Methods
...and 28 more sections

Key Result

Proposition 2.1

(Doob's $h$-transform rogers2000diffusions) Consider the reverse SDE: where time flows backwards and with transition densities $\begin{tikzpicture}[baseline=(char.base)]{ \node[inner sep=0pt, outer sep=0pt] (char) {$p$}; \draw[line width=0.2pt] ($(char.north west)+(0em,0.25em)$) -- ($(char.north east)+(-0.05em,0.25em)$); \draw[line width=0.2pt] ($(char.nor such that $\mathrm{Law}\left

Figures (5)

Figure 1: Schematic illustration of several common approaches to (conditionally) sample from a diffusion model. The sampling space is partitioned into motif coordinates (vertical) and scaffold coordinates (horizontal). The target motif is marked as $x_\text{motif}^\star$ and regions with plausible scaffolds are illustrated as blue blobs. A clear definition of each approach as pseudo-algorithm is given in \ref{['app:algorithms']}.
Figure 2: Some conditional samples.
Figure 3: Conditional protein designs in yellow with target motif 3IXT in blue.
Figure 4: Comparison of our method to RFDiffusion for motif scaffolding for 12 continuous targets. Note that we trained our 4.1M parameter model for only 4000 epochs ($\sim$300 A100 hours in total), which is significantly less both in compute and parameter size than RFDiffusion ($\sim$26'000+ A100 hours, 59.8M parameters). For the motifs marked with *, we had to shorten the sampled scaffold ranges on both sides of the motif from 0-65 (0-63 for TMRX80) to 0-50 since we trained our version of Genie only on protein generation up to a length of 128 residues. Performance numbers from RFDiffusion are taken from the original publication watson2023novo and our designs were created with the same design specifications as described there. We note that our folding step uses ESMFold instead of AlphaFold2, but we have future plans to use AlphaFold2 for a more direct comparison.
Figure 5: Data ablation study on a newly curated SCOPe benchmark dataset with our amortised training model. (a) We utilise the hierarchical structural clustering of SCOPe to create hold-out sets at three different levels of structural hierarchy: the fold, the family and the superfamily level. (b) We test the motif scaffolding performance on these splits and see decreasing scaffolding success for structurally dissimilar samples. (c) The same metrics as in (b), but only for samples that fulfill the definition of in silico success. (d) Scaffolding success by SCOPe class. Alpha helices can be scaffolded successfully, whereas other classes are more challenging.

Theorems & Definitions (7)

Proposition 2.1
Corollary 2.2
Proposition 2.3
Corollary 2.4
Proposition 2.5
proof
proof

A framework for conditional diffusion modelling with applications in motif scaffolding for protein design

TL;DR

Abstract

A framework for conditional diffusion modelling with applications in motif scaffolding for protein design

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (7)