DEFT: Efficient Fine-Tuning of Diffusion Models by Learning the Generalised $h$-transform
Alexander Denker, Francisco Vargas, Shreyas Padhy, Kieran Didi, Simon Mathis, Vincent Dutordoir, Riccardo Barbano, Emile Mathieu, Urszula Julia Komorowska, Pietro Lio
TL;DR
DEFT introduces a unified Doob's $h$-transform framework for conditional diffusion models and shows that a small, trainable $h$-transform network can be fine-tuned to learn the conditional score $\nabla_{\bm{x}} \ln p_t(\bm{y}|\bm{x})$ while keeping a large unconditional diffusion model fixed. It provides a simulation-free, score-matching objective and a stochastic-control interpretation, enabling both amortised and zero-shot conditioning strategies. Empirically, DEFT delivers faster conditional sampling and strong perceptual and reconstruction performance across image inpainting, super-resolution, HDR, phase retrieval, non-linear deblurring, CT, and protein motif scaffolding, often outperforming state-of-the-art baselines. The approach is particularly advantageous when direct backpropagation through the large base model is infeasible (e.g., API-restricted settings) and demonstrates substantial speedups and data-efficient conditioning with practical impact for inverse problems and bio-design.
Abstract
Generative modelling paradigms based on denoising diffusion processes have emerged as a leading candidate for conditional sampling in inverse problems. In many real-world applications, we often have access to large, expensively trained unconditional diffusion models, which we aim to exploit for improving conditional sampling. Most recent approaches are motivated heuristically and lack a unifying framework, obscuring connections between them. Further, they often suffer from issues such as being very sensitive to hyperparameters, being expensive to train or needing access to weights hidden behind a closed API. In this work, we unify conditional training and sampling using the mathematically well-understood Doob's h-transform. This new perspective allows us to unify many existing methods under a common umbrella. Under this framework, we propose DEFT (Doob's h-transform Efficient FineTuning), a new approach for conditional generation that simply fine-tunes a very small network to quickly learn the conditional $h$-transform, while keeping the larger unconditional network unchanged. DEFT is much faster than existing baselines while achieving state-of-the-art performance across a variety of linear and non-linear benchmarks. On image reconstruction tasks, we achieve speedups of up to 1.6$\times$, while having the best perceptual quality on natural images and reconstruction performance on medical images. Further, we also provide initial experiments on protein motif scaffolding and outperform reconstruction guidance methods.
