Guidance Free Image Editing via Explicit Conditioning
Mehdi Noroozi, Alberto Gil Ramos, Luca Morreale, Ruchika Chavhan, Malcolm Chadwick, Abhinav Mehrotra, Sourav Bhattacharya
TL;DR
The paper tackles the high computational cost of CFG in conditional diffusion for image editing. It introduces Explicit Conditioning (EC), which encodes conditioning information into a specialized end-to-end diffusion distribution by sampling $y \sim \mathcal{N}(\mu_\psi(\mathbf{c}), \Sigma_\psi(\mathbf{c}) \mathbf{I})$ and using $z_t = \alpha_t x + \sigma_t y$, thereby enabling guidance-free, single-pass sampling. EC for image editing combines a context encoder based on a Stable Diffusion VAE and a Prompt VAE (mapping CLIP embeddings to a latent space) to produce a jointly conditioned sampling distribution, with optional CLIP tokens enhancing convergence. Empirical results on Instruct-pix2pix show EC achieves higher Directional CLIP and Visual Similarity metrics than CFG with 3 passes, while being about 3× faster, demonstrating practical gains in both quality and efficiency; the approach also offers theoretical grounding via diffusion-OT and flow-transport interpretations. The findings suggest explicit conditioning as a general strategy to reduce inference cost in conditional generative models and could extend to broader diffusion/flow tasks beyond image editing.
Abstract
Current sampling mechanisms for conditional diffusion models rely mainly on Classifier Free Guidance (CFG) to generate high-quality images. However, CFG requires several denoising passes in each time step, e.g., up to three passes in image editing tasks, resulting in excessive computational costs. This paper introduces a novel conditioning technique to ease the computational burden of the well-established guidance techniques, thereby significantly improving the inference time of diffusion models. We present Explicit Conditioning (EC) of the noise distribution on the input modalities to achieve this. Intuitively, we model the noise to guide the conditional diffusion model during the diffusion process. We present evaluations on image editing tasks and demonstrate that EC outperforms CFG in generating diverse high-quality images with significantly reduced computations.
