Context-Guided Diffusion for Out-of-Distribution Molecular and Protein Design
Leo Klarner, Tim G. J. Rudner, Garrett M. Morris, Charlotte M. Deane, Yee Whye Teh
TL;DR
Context-guided diffusion (CGD) tackles the challenge of sampling high-value regions beyond training data by regularizing the guidance function with unlabeled context data. The method constructs a data- and noise-scale dependent regularizer that yields high predictive uncertainty and smooth gradients on out-of-distribution inputs, guiding the reverse diffusion toward promising near-OOD regions without altering model architectures or sampling. Across graph-structured molecular diffusion, equivariant diffusion for materials, and discrete protein sequence diffusion, CGD yields substantial improvements over standard guidance and domain-adaptation baselines, with gains strongest when labeled data are scarce or biased. The work demonstrates CGD's versatility and practical impact for accelerating discovery in chemistry, materials science, and biology, and outlines avenues for extending context-aware guidance through physics-based signals, active context selection, and meta-learning.
Abstract
Generative models have the potential to accelerate key steps in the discovery of novel molecular therapeutics and materials. Diffusion models have recently emerged as a powerful approach, excelling at unconditional sample generation and, with data-driven guidance, conditional generation within their training domain. Reliably sampling from high-value regions beyond the training data, however, remains an open challenge -- with current methods predominantly focusing on modifying the diffusion process itself. In this paper, we develop context-guided diffusion (CGD), a simple plug-and-play method that leverages unlabeled data and smoothness constraints to improve the out-of-distribution generalization of guided diffusion models. We demonstrate that this approach leads to substantial performance gains across various settings, including continuous, discrete, and graph-structured diffusion processes with applications across drug discovery, materials science, and protein design.
