Table of Contents
Fetching ...

CO3: Contrasting Concepts Compose Better

Debottam Dutta, Jianchong Chen, Rajalaxmi Rajagopalan, Yu-Lin Wei, Romit Roy Choudhury

TL;DR

The paper addresses semantic misalignment in diffusion-based text-to-image generation for prompts containing multiple concepts. It introduces CO3, a gradient-free, plug-and-play corrective sampling framework that combines a concept-contrastive corrector with Tweedie-denoised space composition to suppress degenerate modes where individual concepts dominate. By employing resampler and corrector variants with closeness-aware weight modulation, CO3 achieves balanced, high-fidelity multi-concept representations and demonstrates improved performance over baselines on SDXL and PixArt-Sigma across BLIP-VQA and ImageReward metrics. The approach is model-agnostic and does not require retraining, offering a practical pathway to more robust compositional generation in diffusion models.

Abstract

We propose to improve multi-concept prompt fidelity in text-to-image diffusion models. We begin with common failure cases-prompts like "a cat and a dog" that sometimes yields images where one concept is missing, faint, or colliding awkwardly with another. We hypothesize that this happens when the diffusion model drifts into mixed modes that over-emphasize a single concept it learned strongly during training. Instead of re-training, we introduce a corrective sampling strategy that steers away from regions where the joint prompt behavior overlaps too strongly with any single concept in the prompt. The goal is to steer towards "pure" joint modes where all concepts can coexist with balanced visual presence. We further show that existing multi-concept guidance schemes can operate in unstable weight regimes that amplify imbalance; we characterize favorable regions and adapt sampling to remain within them. Our approach, CO3, is plug-and-play, requires no model tuning, and complements standard classifier-free guidance. Experiments on diverse multi-concept prompts indicate improvements in concept coverage, balance and robustness, with fewer dropped or distorted concepts compared to standard baselines and prior compositional methods. Results suggest that lightweight corrective guidance can substantially mitigate brittle semantic alignment behavior in modern diffusion systems.

CO3: Contrasting Concepts Compose Better

TL;DR

The paper addresses semantic misalignment in diffusion-based text-to-image generation for prompts containing multiple concepts. It introduces CO3, a gradient-free, plug-and-play corrective sampling framework that combines a concept-contrastive corrector with Tweedie-denoised space composition to suppress degenerate modes where individual concepts dominate. By employing resampler and corrector variants with closeness-aware weight modulation, CO3 achieves balanced, high-fidelity multi-concept representations and demonstrates improved performance over baselines on SDXL and PixArt-Sigma across BLIP-VQA and ImageReward metrics. The approach is model-agnostic and does not require retraining, offering a practical pathway to more robust compositional generation in diffusion models.

Abstract

We propose to improve multi-concept prompt fidelity in text-to-image diffusion models. We begin with common failure cases-prompts like "a cat and a dog" that sometimes yields images where one concept is missing, faint, or colliding awkwardly with another. We hypothesize that this happens when the diffusion model drifts into mixed modes that over-emphasize a single concept it learned strongly during training. Instead of re-training, we introduce a corrective sampling strategy that steers away from regions where the joint prompt behavior overlaps too strongly with any single concept in the prompt. The goal is to steer towards "pure" joint modes where all concepts can coexist with balanced visual presence. We further show that existing multi-concept guidance schemes can operate in unstable weight regimes that amplify imbalance; we characterize favorable regions and adapt sampling to remain within them. Our approach, CO3, is plug-and-play, requires no model tuning, and complements standard classifier-free guidance. Experiments on diverse multi-concept prompts indicate improvements in concept coverage, balance and robustness, with fewer dropped or distorted concepts compared to standard baselines and prior compositional methods. Results suggest that lightweight corrective guidance can substantially mitigate brittle semantic alignment behavior in modern diffusion systems.

Paper Structure

This paper contains 25 sections, 3 theorems, 24 equations, 9 figures, 8 tables, 3 algorithms.

Key Result

Lemma 1

Let $\hat{x}_{tweedie}[{\epsilon}_t^{\lambda, c}]:=x_t - \sigma_t \ {\epsilon}_t^{\lambda, c}$ be the tweedie mean from CFG composed noise $\tilde{{\epsilon}_t}^{\lambda} = {\epsilon}_t^{\phi} + \lambda({\epsilon}_t^{C} - {\epsilon}_t^{\phi})$ for some $\lambda$. Let, $\Tilde{\hat{x}}_{tweedie}$ be

Figures (9)

  • Figure 1: The figure illustrates our hypothesis on mode overlap using a simple 2D toy example. (a) Two modes of the distribution $p_t(x |\textit{"a cat and a dog"})$ (in green contour) has significant overlap with the modes of the individual concept distributions $p_t(x|\textit{"a cat"})$ (in red contour) and $p_t(x|\textit{"a dog"})$ (in orange contour). (b) The proposed corrector distribution $p_t(x |\texttt{"a cat and a dog"}) / (p_t(x|\textit{"a cat"}) p_t(x| \textit{"a dog"}))$ suppresses these overlaps, steering the generation away from problematic modes. The arrows indicate the denoising directions.
  • Figure 2: Characterization of Resampler and Corrector steps. Resampling is more powerful at high $t$ while the Corrector improves slowly with more timesteps and saturates.
  • Figure 3: Qualitative comparison of different methods on simpler prompts.
  • Figure 4: Qualitative comparison of CO3 with competing methods on complex prompts.
  • Figure 5: Model Agnostic behavior: Qualitative comparison of generation from PixART-$\Sigma$pixart base diffusion model, PixART-$\Sigma$ + CO3, and PixART-$\Sigma$ + Composable Diffusion.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Lemma 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof