Table of Contents
Fetching ...

CoInD: Enabling Logical Compositions in Diffusion Models

Sachit Gaudi, Gautam Sreekumar, Vishnu Boddeti

TL;DR

This work tackles the challenge of enabling diffusion models to sample data corresponding to arbitrary logical combinations of independently varying attributes. It demonstrates that standard conditional diffusion models fail to honor conditional independence of attribute marginals, especially under non-uniform or partial training support, leading to degraded fidelity and controllability. The authors introduce CoInD, a monolithic diffusion training objective that explicitly enforces conditional independence by minimizing the Fisher divergence between the joint $p(\bm{X}\mid C)$ and the causal factorization into marginals, with a practical pairwise independence approximation and final loss $\mathcal{L}_{final}=\mathcal{L}_{score}+\lambda\mathcal{L}_{CI}$. Across Colored MNIST, Shapes3D, and CelebA, CoInD yields lower JSD and higher conformity scores for unseen AND/NOT compositions, increases diversity in uncontrolled attributes, and allows fine-grained control via a $\gamma$ parameter, highlighting practical gains for controlled and robust compositional generation; the method also adapts to text-to-image pipelines through fine-tuning.

Abstract

How can we learn generative models to sample data with arbitrary logical compositions of statistically independent attributes? The prevailing solution is to sample from distributions expressed as a composition of attributes' conditional marginal distributions under the assumption that they are statistically independent. This paper shows that standard conditional diffusion models violate this assumption, even when all attribute compositions are observed during training. And, this violation is significantly more severe when only a subset of the compositions is observed. We propose CoInD to address this problem. It explicitly enforces statistical independence between the conditional marginal distributions by minimizing Fisher's divergence between the joint and marginal distributions. The theoretical advantages of CoInD are reflected in both qualitative and quantitative experiments, demonstrating a significantly more faithful and controlled generation of samples for arbitrary logical compositions of attributes. The benefit is more pronounced for scenarios that current solutions relying on the assumption of conditionally independent marginals struggle with, namely, logical compositions involving the NOT operation and when only a subset of compositions are observed during training.

CoInD: Enabling Logical Compositions in Diffusion Models

TL;DR

This work tackles the challenge of enabling diffusion models to sample data corresponding to arbitrary logical combinations of independently varying attributes. It demonstrates that standard conditional diffusion models fail to honor conditional independence of attribute marginals, especially under non-uniform or partial training support, leading to degraded fidelity and controllability. The authors introduce CoInD, a monolithic diffusion training objective that explicitly enforces conditional independence by minimizing the Fisher divergence between the joint and the causal factorization into marginals, with a practical pairwise independence approximation and final loss . Across Colored MNIST, Shapes3D, and CelebA, CoInD yields lower JSD and higher conformity scores for unseen AND/NOT compositions, increases diversity in uncontrolled attributes, and allows fine-grained control via a parameter, highlighting practical gains for controlled and robust compositional generation; the method also adapts to text-to-image pipelines through fine-tuning.

Abstract

How can we learn generative models to sample data with arbitrary logical compositions of statistically independent attributes? The prevailing solution is to sample from distributions expressed as a composition of attributes' conditional marginal distributions under the assumption that they are statistically independent. This paper shows that standard conditional diffusion models violate this assumption, even when all attribute compositions are observed during training. And, this violation is significantly more severe when only a subset of the compositions is observed. We propose CoInD to address this problem. It explicitly enforces statistical independence between the conditional marginal distributions by minimizing Fisher's divergence between the joint and marginal distributions. The theoretical advantages of CoInD are reflected in both qualitative and quantitative experiments, demonstrating a significantly more faithful and controlled generation of samples for arbitrary logical compositions of attributes. The benefit is more pronounced for scenarios that current solutions relying on the assumption of conditionally independent marginals struggle with, namely, logical compositions involving the NOT operation and when only a subset of compositions are observed during training.

Paper Structure

This paper contains 44 sections, 61 equations, 18 figures, 6 tables, 1 algorithm.

Figures (18)

  • Figure 1: Generative Modeling of Logical Compositions. (a-c) Consider the task of generating MNIST samples for any logical composition of digits and colors by learning on observational data of different supports. (d) Standard diffusion models fail to generate data with arbitrary logical compositions of attributes. We generate data from simple unseen compositions (row 2), and more complex logical compositions (rows 3,4) through CoInD, even under non-uniform and partial support.
  • Figure 2: (a) $C_1, C_2, \dots, C_n$ vary freely and independently in the underlying causal graph. (b) However, they become dependent during training due to unknown and unobserved confounding factors.
  • Figure 3: Orthogonal partial support
  • Figure 4: Results on Colored MNIST dataset. (a) We compare JSD and CS of CoInD against baselines trained under various settings and on different compositional tasks. (b) Plotting CS against JSD in the log scale of the models trained under different settings reveals a negative correlation.
  • Figure 5: Images generated by CoInD for the logical composition $\text{digit} = 4$ under non-uniform scenario are significantly diverse compared to the baselines. $H$ is the Shannon entropy.
  • ...and 13 more figures

Theorems & Definitions (1)

  • Definition 1: Support Cover