Table of Contents
Fetching ...

PostEdit: Posterior Sampling for Efficient Zero-Shot Image Editing

Feng Tian, Yixuan Li, Yichao Yan, Shanyan Guan, Yanhao Ge, Xiaokang Yang

TL;DR

PostEdit tackles three core image editing challenges: controllability, background preservation, and efficiency. It reframes editing as posterior sampling in diffusion models, leveraging a measurement term that encodes the initial image features to steer sampling toward the target prompt while preserving unedited regions. The method is inversion-free and training-free, uses a latent-space optimization with Langevin dynamics and a weighted fusion with the input latent, and achieves fast inference around 1.5 seconds with state-of-the-art editing performance on PIE-Bench. The approach offers practical impact for interactive image editing, enabling high-quality edits with strong background fidelity across diverse scenes.

Abstract

In the field of image editing, three core challenges persist: controllability, background preservation, and efficiency. Inversion-based methods rely on time-consuming optimization to preserve the features of the initial images, which results in low efficiency due to the requirement for extensive network inference. Conversely, inversion-free methods lack theoretical support for background similarity, as they circumvent the issue of maintaining initial features to achieve efficiency. As a consequence, none of these methods can achieve both high efficiency and background consistency. To tackle the challenges and the aforementioned disadvantages, we introduce PostEdit, a method that incorporates a posterior scheme to govern the diffusion sampling process. Specifically, a corresponding measurement term related to both the initial features and Langevin dynamics is introduced to optimize the estimated image generated by the given target prompt. Extensive experimental results indicate that the proposed PostEdit achieves state-of-the-art editing performance while accurately preserving unedited regions. Furthermore, the method is both inversion- and training-free, necessitating approximately 1.5 seconds and 18 GB of GPU memory to generate high-quality results.

PostEdit: Posterior Sampling for Efficient Zero-Shot Image Editing

TL;DR

PostEdit tackles three core image editing challenges: controllability, background preservation, and efficiency. It reframes editing as posterior sampling in diffusion models, leveraging a measurement term that encodes the initial image features to steer sampling toward the target prompt while preserving unedited regions. The method is inversion-free and training-free, uses a latent-space optimization with Langevin dynamics and a weighted fusion with the input latent, and achieves fast inference around 1.5 seconds with state-of-the-art editing performance on PIE-Bench. The approach offers practical impact for interactive image editing, enabling high-quality edits with strong background fidelity across diverse scenes.

Abstract

In the field of image editing, three core challenges persist: controllability, background preservation, and efficiency. Inversion-based methods rely on time-consuming optimization to preserve the features of the initial images, which results in low efficiency due to the requirement for extensive network inference. Conversely, inversion-free methods lack theoretical support for background similarity, as they circumvent the issue of maintaining initial features to achieve efficiency. As a consequence, none of these methods can achieve both high efficiency and background consistency. To tackle the challenges and the aforementioned disadvantages, we introduce PostEdit, a method that incorporates a posterior scheme to govern the diffusion sampling process. Specifically, a corresponding measurement term related to both the initial features and Langevin dynamics is introduced to optimize the estimated image generated by the given target prompt. Extensive experimental results indicate that the proposed PostEdit achieves state-of-the-art editing performance while accurately preserving unedited regions. Furthermore, the method is both inversion- and training-free, necessitating approximately 1.5 seconds and 18 GB of GPU memory to generate high-quality results.
Paper Structure (47 sections, 2 theorems, 27 equations, 19 figures, 8 tables, 2 algorithms)

This paper contains 47 sections, 2 theorems, 27 equations, 19 figures, 8 tables, 2 algorithms.

Key Result

Proposition 1

The weighted relationship between the estimated $\boldsymbol{\hat{z}}_0$ and the initial image $\boldsymbol{z}_{in}$ to correct evaluated $\boldsymbol{z}_0$ is defined as $\left(0\le w \le 0.1\right)$ where $w$ is a constant to govern the intensity of the injected features.

Figures (19)

  • Figure 1: Different Image Editing Schemes. The inversion-based method, illustrated in the top-left section, involves adding noise from a pre-trained network to a clean image. It then denoises the image based on a target prompt, though it requires time-consuming tuning to ensure background preservation. The top-right section discusses training-based, inversion-free methods, which train a learnable model to achieve satisfactory results but have limited generalization capabilities. Our approach, outlined in the bottom section, is both inversion-free and training-free.
  • Figure 2: Method Overview. The latent representation of initial image $\boldsymbol{x}_0$ is $\boldsymbol{z}_0$. It is adding noise randomly to $\boldsymbol{z}_T$ and then $\boldsymbol{\hat{z}}_0$ is estimated from $\boldsymbol{z}_T$ through diffusion ODE solvers. After that, there are two optimization terms relating to $\boldsymbol{\hat{z}}_0$, the given measurement $\boldsymbol{y}$ and a random noise term $\boldsymbol{\epsilon}$, which is applied to optimize calculated $\boldsymbol{\hat{z}}_0$ while avoids solutions falling in local optimality. Then the optimized $\boldsymbol{\hat{z}}_0$ is adding noise to timestep $T-1$ according to the noise scheduler. This process operates recursively and finished till $\boldsymbol{\hat{z}}_T$ is converged to $\boldsymbol{z}_0$, where $z_0^*$ is the finally optimized output.
  • Figure 3: Qualitative Comparison of Reconstruction. It takes 1.5 seconds for our method to reconstruct the input image, and the time is 1.8s, 2s, 15s, and 120s for iCD, DDCM, NPI, and NTI, respectively. Our framework can faithfully reconstruct the foreground object and the background.
  • Figure 4: Qualitative Comparison of Editing. Our method performs better than the others in aligning with target prompts while maintaining the background similarity.
  • Figure 5: Ablation Studies. We show the results without the optimization process shown in Eq. \ref{['rec']}, the measurement $\boldsymbol{y}$ defined in Eq. \ref{['sample_2']} and $\boldsymbol{z}_{in}$ shown in Proposition \ref{['prop1']}.
  • ...and 14 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Remark 1
  • Proposition 2