Improving Compositional Generation with Diffusion Models Using Lift Scores
Chenning Yu, Sicun Gao
TL;DR
This work tackles the challenge of compositional generation in diffusion models by introducing CompLift, a training-free rejection/resampling criterion based on lift scores that quantify how well a sample aligns with individual conditions. The lift score is approximated from the original diffusion model's predictions via $\text{lift}(x|c) \approx \mathbb{E}_{t,\epsilon}\{ \|\epsilon - \epsilon_\theta(x_t,\varnothing)\|^2 - \|\epsilon - \epsilon_\theta(x_t,c)\|^2 \}$, enabling decomposition of complex prompts into simpler sub-conditions and their compositional combination (AND/OR/NOT). The authors explore the design space (noise, timesteps, caching) and scale the approach to text-to-image generation, including a pixel-activation scheme in latent space and a variance-reduction technique that substitutes $\epsilon_\theta(\mathbf{z}_t, \mathbf{c}_{\text{compose}})$ for $\epsilon$ to stabilize estimates. Empirical results in 2D synthetic data, CLEVR Position, and text-to-image generation show that CompLift improves condition alignment and robustly handles multi-object prompts, with caching substantially reducing inference overhead. The method is training-free and model-agnostic, offering a practical route to enhance diffusion-based compositionality without additional training or external modules. Overall, CompLift represents a principled, efficient approach to enforce complex prompts in diffusion models, with demonstrated benefits and clear directions for future expansion (e.g., beyond static images).
Abstract
We introduce a novel resampling criterion using lift scores, for improving compositional generation in diffusion models. By leveraging the lift scores, we evaluate whether generated samples align with each single condition and then compose the results to determine whether the composed prompt is satisfied. Our key insight is that lift scores can be efficiently approximated using only the original diffusion model, requiring no additional training or external modules. We develop an optimized variant that achieves relatively lower computational overhead during inference while maintaining effectiveness. Through extensive experiments, we demonstrate that lift scores significantly improved the condition alignment for compositional generation across 2D synthetic data, CLEVR position tasks, and text-to-image synthesis. Our code is available at http://rainorangelemon.github.io/complift.
