Don't drop your samples! Coherence-aware training benefits Conditional diffusion
Nicolas Dufour, Victor Besnier, Vicky Kalogeiton, David Picard
TL;DR
This work addresses the challenge of training conditional diffusion models when conditioning signals are noisy or misaligned. It introduces Coherence-Aware Diffusion (CAD), which conditions the model on both the target condition and a coherence score $c \in [0,1]$, training with $\epsilon_{\theta}(X_t,y,c,t)$ and enabling unconditional generation when coherence is low. The authors also refine classifier-free guidance into a coherence-aware variant (CA-CFG) and demonstrate CAD across text, class, and semantic conditioning on COCO, ImageNet, and ADE20K, demonstrating improved fidelity and adherence to prompts compared to baselines and data-filtering strategies. The results show CAD yields more realistic and diverse samples that better respect conditioning while enabling flexible inference over coherence. Overall, CAD offers a practical, scalable approach to leverages noisy web-scale data for high-quality conditional image synthesis.
Abstract
Conditional diffusion models are powerful generative models that can leverage various types of conditional information, such as class labels, segmentation masks, or text captions. However, in many real-world scenarios, conditional information may be noisy or unreliable due to human annotation errors or weak alignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a novel method that integrates coherence in conditional information into diffusion models, allowing them to learn from noisy annotations without discarding data. We assume that each data point has an associated coherence score that reflects the quality of the conditional information. We then condition the diffusion model on both the conditional information and the coherence score. In this way, the model learns to ignore or discount the conditioning when the coherence is low. We show that CAD is theoretically sound and empirically effective on various conditional generation tasks. Moreover, we show that leveraging coherence generates realistic and diverse samples that respect conditional information better than models trained on cleaned datasets where samples with low coherence have been discarded.
