Table of Contents
Fetching ...

Guidance Free Image Editing via Explicit Conditioning

Mehdi Noroozi, Alberto Gil Ramos, Luca Morreale, Ruchika Chavhan, Malcolm Chadwick, Abhinav Mehrotra, Sourav Bhattacharya

TL;DR

The paper tackles the high computational cost of CFG in conditional diffusion for image editing. It introduces Explicit Conditioning (EC), which encodes conditioning information into a specialized end-to-end diffusion distribution by sampling $y \sim \mathcal{N}(\mu_\psi(\mathbf{c}), \Sigma_\psi(\mathbf{c}) \mathbf{I})$ and using $z_t = \alpha_t x + \sigma_t y$, thereby enabling guidance-free, single-pass sampling. EC for image editing combines a context encoder based on a Stable Diffusion VAE and a Prompt VAE (mapping CLIP embeddings to a latent space) to produce a jointly conditioned sampling distribution, with optional CLIP tokens enhancing convergence. Empirical results on Instruct-pix2pix show EC achieves higher Directional CLIP and Visual Similarity metrics than CFG with 3 passes, while being about 3× faster, demonstrating practical gains in both quality and efficiency; the approach also offers theoretical grounding via diffusion-OT and flow-transport interpretations. The findings suggest explicit conditioning as a general strategy to reduce inference cost in conditional generative models and could extend to broader diffusion/flow tasks beyond image editing.

Abstract

Current sampling mechanisms for conditional diffusion models rely mainly on Classifier Free Guidance (CFG) to generate high-quality images. However, CFG requires several denoising passes in each time step, e.g., up to three passes in image editing tasks, resulting in excessive computational costs. This paper introduces a novel conditioning technique to ease the computational burden of the well-established guidance techniques, thereby significantly improving the inference time of diffusion models. We present Explicit Conditioning (EC) of the noise distribution on the input modalities to achieve this. Intuitively, we model the noise to guide the conditional diffusion model during the diffusion process. We present evaluations on image editing tasks and demonstrate that EC outperforms CFG in generating diverse high-quality images with significantly reduced computations.

Guidance Free Image Editing via Explicit Conditioning

TL;DR

The paper tackles the high computational cost of CFG in conditional diffusion for image editing. It introduces Explicit Conditioning (EC), which encodes conditioning information into a specialized end-to-end diffusion distribution by sampling and using , thereby enabling guidance-free, single-pass sampling. EC for image editing combines a context encoder based on a Stable Diffusion VAE and a Prompt VAE (mapping CLIP embeddings to a latent space) to produce a jointly conditioned sampling distribution, with optional CLIP tokens enhancing convergence. Empirical results on Instruct-pix2pix show EC achieves higher Directional CLIP and Visual Similarity metrics than CFG with 3 passes, while being about 3× faster, demonstrating practical gains in both quality and efficiency; the approach also offers theoretical grounding via diffusion-OT and flow-transport interpretations. The findings suggest explicit conditioning as a general strategy to reduce inference cost in conditional generative models and could extend to broader diffusion/flow tasks beyond image editing.

Abstract

Current sampling mechanisms for conditional diffusion models rely mainly on Classifier Free Guidance (CFG) to generate high-quality images. However, CFG requires several denoising passes in each time step, e.g., up to three passes in image editing tasks, resulting in excessive computational costs. This paper introduces a novel conditioning technique to ease the computational burden of the well-established guidance techniques, thereby significantly improving the inference time of diffusion models. We present Explicit Conditioning (EC) of the noise distribution on the input modalities to achieve this. Intuitively, we model the noise to guide the conditional diffusion model during the diffusion process. We present evaluations on image editing tasks and demonstrate that EC outperforms CFG in generating diverse high-quality images with significantly reduced computations.

Paper Structure

This paper contains 25 sections, 21 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Image editing performance for the context image in (a) with the instruction prompt: "Make her a bride". CFG (Eq. \ref{['eq:cfg implicit']}) are shown with $\times 1$ pass ($s_I=1.0, s_P=1.0$) in (d), $\times 2$ passes ($s_I=1.0, s_P=7.5$) in (e), and $\times 3$ passes ($s_I=1.6, s_P=7.5$) in (f). Our proposed explicit conditioning result with a single pass is shown in (c).
  • Figure 2: Explicit conditioning training: We obtain the mean and variance encoder of the context image via the SD encoder, shown in blue. For the instruction prompt, we use our prompt VAE encoder, shown in yellow, that has the same latent dimension as the SD, i.e., $4\times 64 \times 64$. It takes the pooled CLIP embeddings, shown in red, as input and maps them to the corresponding mean and variance. The context and prompt mean and variances are fused to form a Gaussian used for sampling in the diffusion process, as in Eq. \ref{['eq: y sim context and prompt']}. We keep the full $77$ CLIP embeddings, shown in purple, as input to the UNet for the sake of consistency with the internalization.
  • Figure 3: Inference: We call the denoising model recursively starting from a point sampled from a Gaussian that its mean and variance is a fusion of context image and instruction prompt.
  • Figure 4: Qualitative evaluations, failure cases. (a) input context image, (b) CFG $\times 3$ passes, i.e. $s_I=1.6, s_P=7.5$. (c) our proposed $\times 1$ pass explicit conditioning.
  • Figure 5: Qualitative evaluations. (a) input context image, (b) CFG $\times 1$ pass, i.e. $s_I=1.0, s_P=1.0$ (c) CFG $\times 3$ passes, i.e. $s_I=1.6, s_P=7.5$. (d) our proposed $\times 1$ pass explicit conditioning. Our method outperforms CFG while being $\times 3$ faster.