Table of Contents
Fetching ...

Don't drop your samples! Coherence-aware training benefits Conditional diffusion

Nicolas Dufour, Victor Besnier, Vicky Kalogeiton, David Picard

TL;DR

This work addresses the challenge of training conditional diffusion models when conditioning signals are noisy or misaligned. It introduces Coherence-Aware Diffusion (CAD), which conditions the model on both the target condition and a coherence score $c \in [0,1]$, training with $\epsilon_{\theta}(X_t,y,c,t)$ and enabling unconditional generation when coherence is low. The authors also refine classifier-free guidance into a coherence-aware variant (CA-CFG) and demonstrate CAD across text, class, and semantic conditioning on COCO, ImageNet, and ADE20K, demonstrating improved fidelity and adherence to prompts compared to baselines and data-filtering strategies. The results show CAD yields more realistic and diverse samples that better respect conditioning while enabling flexible inference over coherence. Overall, CAD offers a practical, scalable approach to leverages noisy web-scale data for high-quality conditional image synthesis.

Abstract

Conditional diffusion models are powerful generative models that can leverage various types of conditional information, such as class labels, segmentation masks, or text captions. However, in many real-world scenarios, conditional information may be noisy or unreliable due to human annotation errors or weak alignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a novel method that integrates coherence in conditional information into diffusion models, allowing them to learn from noisy annotations without discarding data. We assume that each data point has an associated coherence score that reflects the quality of the conditional information. We then condition the diffusion model on both the conditional information and the coherence score. In this way, the model learns to ignore or discount the conditioning when the coherence is low. We show that CAD is theoretically sound and empirically effective on various conditional generation tasks. Moreover, we show that leveraging coherence generates realistic and diverse samples that respect conditional information better than models trained on cleaned datasets where samples with low coherence have been discarded.

Don't drop your samples! Coherence-aware training benefits Conditional diffusion

TL;DR

This work addresses the challenge of training conditional diffusion models when conditioning signals are noisy or misaligned. It introduces Coherence-Aware Diffusion (CAD), which conditions the model on both the target condition and a coherence score , training with and enabling unconditional generation when coherence is low. The authors also refine classifier-free guidance into a coherence-aware variant (CA-CFG) and demonstrate CAD across text, class, and semantic conditioning on COCO, ImageNet, and ADE20K, demonstrating improved fidelity and adherence to prompts compared to baselines and data-filtering strategies. The results show CAD yields more realistic and diverse samples that better respect conditioning while enabling flexible inference over coherence. Overall, CAD offers a practical, scalable approach to leverages noisy web-scale data for high-quality conditional image synthesis.

Abstract

Conditional diffusion models are powerful generative models that can leverage various types of conditional information, such as class labels, segmentation masks, or text captions. However, in many real-world scenarios, conditional information may be noisy or unreliable due to human annotation errors or weak alignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a novel method that integrates coherence in conditional information into diffusion models, allowing them to learn from noisy annotations without discarding data. We assume that each data point has an associated coherence score that reflects the quality of the conditional information. We then condition the diffusion model on both the conditional information and the coherence score. In this way, the model learns to ignore or discount the conditioning when the coherence is low. We show that CAD is theoretically sound and empirically effective on various conditional generation tasks. Moreover, we show that leveraging coherence generates realistic and diverse samples that respect conditional information better than models trained on cleaned datasets where samples with low coherence have been discarded.
Paper Structure (39 sections, 2 theorems, 9 equations, 28 figures, 3 tables)

This paper contains 39 sections, 2 theorems, 9 equations, 28 figures, 3 tables.

Key Result

Proposition E.1

Lipschitz continuous conditional neural diffusion models that leverage coherence consistent embeddings for the conditioning are equivalent to unconditional models at low coherence.

Figures (28)

  • Figure 1: Images generated from our model, CAD. Our model showcase high visual quality, aesthetics and prompt following.
  • Figure 2: (a) Examples of images generated with the input coherence score between the prompt and the target image. The score varies from 0 (no coherence) to 1 (maximum coherence). Higher coherence scores tend to generate images that adhere more effectively to the prompt. Top prompt: "a raccoon wearing an astronaut suit. The racoon is looking out of the window at a starry night; unreal engine, detailed, digital painting,cinematic,character design by pixar and hayao miyazaki, unreal 5, daz, hyperrealistic, octane render", bottom prompt: "An armchair in the shape of an avocado" (b) Increasing the coherence from 0 to 1, CLIPScore increases and FID decreases.
  • Figure 3: Images generated with a TextRIN trained with different handling of the misalignment between the image and its associated text at training. Compared to doing nothing (baseline), removing misaligned samples (filtering) or weighting the loss (weighted), our Coherence-Aware Diffusion training (CAD) generates more visually pleasing images while better adhering to the prompt.
  • Figure 4: Text RIN Block. Architecture of the proposed Text RIN Block used in CAD. We include a cross attention from the text to the latent branch of the RIN block.
  • Figure 5: Text-to-image generation results. (a) Quantitative results for text-to-image generation. We show that CAD achieves significantly lower FID, precision, recall, density and coverage while keeping similar CLIP score. (b) User study results. Users had to indicate the highest quality image and the most adhering to the prompt among pairs of images corresponding to our CAD method and one of baseline, filtered or weighted method. (c) FID versus CLIP on the text-to-image task for varying degrees of guidance $\omega$. We show that CAD achieves a significantly better trade-off with a much lower FID for the same CLIP score.
  • ...and 23 more figures

Theorems & Definitions (4)

  • Definition E.1
  • Proposition E.1
  • proof
  • Corollary E.2