CoInD: Enabling Logical Compositions in Diffusion Models
Sachit Gaudi, Gautam Sreekumar, Vishnu Boddeti
TL;DR
This work tackles the challenge of enabling diffusion models to sample data corresponding to arbitrary logical combinations of independently varying attributes. It demonstrates that standard conditional diffusion models fail to honor conditional independence of attribute marginals, especially under non-uniform or partial training support, leading to degraded fidelity and controllability. The authors introduce CoInD, a monolithic diffusion training objective that explicitly enforces conditional independence by minimizing the Fisher divergence between the joint $p(\bm{X}\mid C)$ and the causal factorization into marginals, with a practical pairwise independence approximation and final loss $\mathcal{L}_{final}=\mathcal{L}_{score}+\lambda\mathcal{L}_{CI}$. Across Colored MNIST, Shapes3D, and CelebA, CoInD yields lower JSD and higher conformity scores for unseen AND/NOT compositions, increases diversity in uncontrolled attributes, and allows fine-grained control via a $\gamma$ parameter, highlighting practical gains for controlled and robust compositional generation; the method also adapts to text-to-image pipelines through fine-tuning.
Abstract
How can we learn generative models to sample data with arbitrary logical compositions of statistically independent attributes? The prevailing solution is to sample from distributions expressed as a composition of attributes' conditional marginal distributions under the assumption that they are statistically independent. This paper shows that standard conditional diffusion models violate this assumption, even when all attribute compositions are observed during training. And, this violation is significantly more severe when only a subset of the compositions is observed. We propose CoInD to address this problem. It explicitly enforces statistical independence between the conditional marginal distributions by minimizing Fisher's divergence between the joint and marginal distributions. The theoretical advantages of CoInD are reflected in both qualitative and quantitative experiments, demonstrating a significantly more faithful and controlled generation of samples for arbitrary logical compositions of attributes. The benefit is more pronounced for scenarios that current solutions relying on the assumption of conditionally independent marginals struggle with, namely, logical compositions involving the NOT operation and when only a subset of compositions are observed during training.
