Table of Contents
Fetching ...

Context-Guided Diffusion for Out-of-Distribution Molecular and Protein Design

Leo Klarner, Tim G. J. Rudner, Garrett M. Morris, Charlotte M. Deane, Yee Whye Teh

TL;DR

Context-guided diffusion (CGD) tackles the challenge of sampling high-value regions beyond training data by regularizing the guidance function with unlabeled context data. The method constructs a data- and noise-scale dependent regularizer that yields high predictive uncertainty and smooth gradients on out-of-distribution inputs, guiding the reverse diffusion toward promising near-OOD regions without altering model architectures or sampling. Across graph-structured molecular diffusion, equivariant diffusion for materials, and discrete protein sequence diffusion, CGD yields substantial improvements over standard guidance and domain-adaptation baselines, with gains strongest when labeled data are scarce or biased. The work demonstrates CGD's versatility and practical impact for accelerating discovery in chemistry, materials science, and biology, and outlines avenues for extending context-aware guidance through physics-based signals, active context selection, and meta-learning.

Abstract

Generative models have the potential to accelerate key steps in the discovery of novel molecular therapeutics and materials. Diffusion models have recently emerged as a powerful approach, excelling at unconditional sample generation and, with data-driven guidance, conditional generation within their training domain. Reliably sampling from high-value regions beyond the training data, however, remains an open challenge -- with current methods predominantly focusing on modifying the diffusion process itself. In this paper, we develop context-guided diffusion (CGD), a simple plug-and-play method that leverages unlabeled data and smoothness constraints to improve the out-of-distribution generalization of guided diffusion models. We demonstrate that this approach leads to substantial performance gains across various settings, including continuous, discrete, and graph-structured diffusion processes with applications across drug discovery, materials science, and protein design.

Context-Guided Diffusion for Out-of-Distribution Molecular and Protein Design

TL;DR

Context-guided diffusion (CGD) tackles the challenge of sampling high-value regions beyond training data by regularizing the guidance function with unlabeled context data. The method constructs a data- and noise-scale dependent regularizer that yields high predictive uncertainty and smooth gradients on out-of-distribution inputs, guiding the reverse diffusion toward promising near-OOD regions without altering model architectures or sampling. Across graph-structured molecular diffusion, equivariant diffusion for materials, and discrete protein sequence diffusion, CGD yields substantial improvements over standard guidance and domain-adaptation baselines, with gains strongest when labeled data are scarce or biased. The work demonstrates CGD's versatility and practical impact for accelerating discovery in chemistry, materials science, and biology, and outlines avenues for extending context-aware guidance through physics-based signals, active context selection, and meta-learning.

Abstract

Generative models have the potential to accelerate key steps in the discovery of novel molecular therapeutics and materials. Diffusion models have recently emerged as a powerful approach, excelling at unconditional sample generation and, with data-driven guidance, conditional generation within their training domain. Reliably sampling from high-value regions beyond the training data, however, remains an open challenge -- with current methods predominantly focusing on modifying the diffusion process itself. In this paper, we develop context-guided diffusion (CGD), a simple plug-and-play method that leverages unlabeled data and smoothness constraints to improve the out-of-distribution generalization of guided diffusion models. We demonstrate that this approach leads to substantial performance gains across various settings, including continuous, discrete, and graph-structured diffusion processes with applications across drug discovery, materials science, and protein design.
Paper Structure (39 sections, 41 equations, 24 figures, 10 tables)

This paper contains 39 sections, 41 equations, 24 figures, 10 tables.

Figures (24)

  • Figure 1: Guidance models that generalize poorly under distribution shifts can be a major performance bottleneck for property-guided diffusion models. We introduce a guidance model regularizer that improves generalization under distribution shifts and enables context-guided diffusion (CGD). We show that CGD leads to conditional sampling processes that consistently generate novel, high-value molecules (red).
  • Figure 2: Context-guided diffusion leverages unlabeled context data to combine signals from labeled training data with structural information of the broader input domain (left). Specifically, we construct a data- and noise scale-dependent guidance model regularizer that encourages smooth gradients, mean reversion, and high predictive uncertainty on out-of-distribution (OOD) inputs, allowing the conditional denoising process under a context-guided diffusion model to focus on promising near-OOD subsets of chemical and protein sequence space (right).
  • Figure 3: Comparison of the small molecules generated with different guided diffusion models across five distinct protein targets. Objective values ($\uparrow$) are normalized with respect to the highest score of the held-out high-property validation set and averaged across five independent training and sampling runs with different random seeds.
  • Figure 4: Comparison of polycyclic aromatic systems generated with different guidance models across ten independent training and sampling runs. Left: Full distribution of generated objective values ($\downarrow$) ablated over different context sets and guidance scales. Right:UMAP plot mcinnes2018umap of training (upper left) and test set (lower right), as well as samples from guided diffusion models. Validity and novelty are analyzed in \ref{['app:materials']} and show similar trends.
  • Figure 5: Pareto fronts of samples generated with different regularization schemes, highlighting the trade-off between objective value $(\uparrow)$ and naturalness $(\uparrow)$. As samples move away from the training data and enter an out-of-distribution regime, our method consistently generates sequences with better properties for any given level of naturalness.
  • ...and 19 more figures