Table of Contents
Fetching ...

Improving Compositional Generation with Diffusion Models Using Lift Scores

Chenning Yu, Sicun Gao

TL;DR

This work tackles the challenge of compositional generation in diffusion models by introducing CompLift, a training-free rejection/resampling criterion based on lift scores that quantify how well a sample aligns with individual conditions. The lift score is approximated from the original diffusion model's predictions via $\text{lift}(x|c) \approx \mathbb{E}_{t,\epsilon}\{ \|\epsilon - \epsilon_\theta(x_t,\varnothing)\|^2 - \|\epsilon - \epsilon_\theta(x_t,c)\|^2 \}$, enabling decomposition of complex prompts into simpler sub-conditions and their compositional combination (AND/OR/NOT). The authors explore the design space (noise, timesteps, caching) and scale the approach to text-to-image generation, including a pixel-activation scheme in latent space and a variance-reduction technique that substitutes $\epsilon_\theta(\mathbf{z}_t, \mathbf{c}_{\text{compose}})$ for $\epsilon$ to stabilize estimates. Empirical results in 2D synthetic data, CLEVR Position, and text-to-image generation show that CompLift improves condition alignment and robustly handles multi-object prompts, with caching substantially reducing inference overhead. The method is training-free and model-agnostic, offering a practical route to enhance diffusion-based compositionality without additional training or external modules. Overall, CompLift represents a principled, efficient approach to enforce complex prompts in diffusion models, with demonstrated benefits and clear directions for future expansion (e.g., beyond static images).

Abstract

We introduce a novel resampling criterion using lift scores, for improving compositional generation in diffusion models. By leveraging the lift scores, we evaluate whether generated samples align with each single condition and then compose the results to determine whether the composed prompt is satisfied. Our key insight is that lift scores can be efficiently approximated using only the original diffusion model, requiring no additional training or external modules. We develop an optimized variant that achieves relatively lower computational overhead during inference while maintaining effectiveness. Through extensive experiments, we demonstrate that lift scores significantly improved the condition alignment for compositional generation across 2D synthetic data, CLEVR position tasks, and text-to-image synthesis. Our code is available at http://rainorangelemon.github.io/complift.

Improving Compositional Generation with Diffusion Models Using Lift Scores

TL;DR

This work tackles the challenge of compositional generation in diffusion models by introducing CompLift, a training-free rejection/resampling criterion based on lift scores that quantify how well a sample aligns with individual conditions. The lift score is approximated from the original diffusion model's predictions via , enabling decomposition of complex prompts into simpler sub-conditions and their compositional combination (AND/OR/NOT). The authors explore the design space (noise, timesteps, caching) and scale the approach to text-to-image generation, including a pixel-activation scheme in latent space and a variance-reduction technique that substitutes for to stabilize estimates. Empirical results in 2D synthetic data, CLEVR Position, and text-to-image generation show that CompLift improves condition alignment and robustly handles multi-object prompts, with caching substantially reducing inference overhead. The method is training-free and model-agnostic, offering a practical route to enhance diffusion-based compositionality without additional training or external modules. Overall, CompLift represents a principled, efficient approach to enforce complex prompts in diffusion models, with demonstrated benefits and clear directions for future expansion (e.g., beyond static images).

Abstract

We introduce a novel resampling criterion using lift scores, for improving compositional generation in diffusion models. By leveraging the lift scores, we evaluate whether generated samples align with each single condition and then compose the results to determine whether the composed prompt is satisfied. Our key insight is that lift scores can be efficiently approximated using only the original diffusion model, requiring no additional training or external modules. We develop an optimized variant that achieves relatively lower computational overhead during inference while maintaining effectiveness. Through extensive experiments, we demonstrate that lift scores significantly improved the condition alignment for compositional generation across 2D synthetic data, CLEVR position tasks, and text-to-image synthesis. Our code is available at http://rainorangelemon.github.io/complift.

Paper Structure

This paper contains 31 sections, 14 equations, 22 figures, 13 tables, 6 algorithms.

Figures (22)

  • Figure 1: An illustration of product, mixture, and negation compositional models, and the improved sampling performance using CompLift. Left to right: Component distributions, ground truth composed distribution, composable diffusion samples, samples accepted by CompLift. Top: product, center: mixture, bottom: negation. $\varnothing$ represents the empty set - no samples are generated or accepted. Each component distribution is trained independently using a 2D score-based diffusion model. Accuracy is evaluated based on whether generated samples fall into the support or within the $3\sigma$ region of the composed distribution (details in \ref{['sec:experiments']}, \ref{['appendix:2d']}).
  • Figure 2: The accuracy of CompLift with different noise sampling strategies on 2D synthetic dataset. See \ref{['sec:effect_noise']}.
  • Figure 3: Accuracy of acceptance/rejection over a single sampled timestep for pretrained model liu2022compositional on CLEVR Position dataset. We found that models trained with importance sampling require importance sampling for ELBO estimation.
  • Figure 4: Accepted and rejected SDXL examples using CompLift criterion. Objects in blue are composed through the given prompts. Here we show the images from the text prompts with the most-improved CLIP scores. See more examples in \ref{['appendix:t2i']}.
  • Figure 5: Average running time on 2D synthetic dataset.$T$ indicates number of trials. Note that our methods can be further optimized to the latency of only 1 forward pass with parallelization, while MCMC methods require sequential computation. See \ref{['tab:2d_summary']} for performance comparison of all the methods.
  • ...and 17 more figures