Table of Contents
Fetching ...

High-Fidelity Diffusion-based Image Editing

Chen Hou, Guoqiang Wei, Zhibo Chen

TL;DR

The paper tackles the fidelity gap in diffusion-based image editing, where larger denoising steps improve reconstruction but not editing due to error propagation in the conditional Markovian process. It introduces a rectifier hypernetwork that modulates diffusion weights using residual features to bridge the fidelity gap, and a score-matching–like editing training paradigm to curb denoising trajectory deviations. A directional CLIP loss with an $ ext{L}_1$ regularizer guides edits, enabling faithful, high-fidelity edits across varying denoising steps without retraining the base diffusion model. Experimentally, the approach yields superior reconstruction and editing quality, robust out-of-domain generalization, and effective image-to-image translation, underscoring its practical potential for diffusion-based editing tasks.

Abstract

Diffusion models have attained remarkable success in the domains of image generation and editing. It is widely recognized that employing larger inversion and denoising steps in diffusion model leads to improved image reconstruction quality. However, the editing performance of diffusion models tends to be no more satisfactory even with increasing denoising steps. The deficiency in editing could be attributed to the conditional Markovian property of the editing process, where errors accumulate throughout denoising steps. To tackle this challenge, we first propose an innovative framework where a rectifier module is incorporated to modulate diffusion model weights with residual features, thereby providing compensatory information to bridge the fidelity gap. Furthermore, we introduce a novel learning paradigm aimed at minimizing error propagation during the editing process, which trains the editing procedure in a manner similar to denoising score-matching. Extensive experiments demonstrate that our proposed framework and training strategy achieve high-fidelity reconstruction and editing results across various levels of denoising steps, meanwhile exhibits exceptional performance in terms of both quantitative metric and qualitative assessments. Moreover, we explore our model's generalization through several applications like image-to-image translation and out-of-domain image editing.

High-Fidelity Diffusion-based Image Editing

TL;DR

The paper tackles the fidelity gap in diffusion-based image editing, where larger denoising steps improve reconstruction but not editing due to error propagation in the conditional Markovian process. It introduces a rectifier hypernetwork that modulates diffusion weights using residual features to bridge the fidelity gap, and a score-matching–like editing training paradigm to curb denoising trajectory deviations. A directional CLIP loss with an regularizer guides edits, enabling faithful, high-fidelity edits across varying denoising steps without retraining the base diffusion model. Experimentally, the approach yields superior reconstruction and editing quality, robust out-of-domain generalization, and effective image-to-image translation, underscoring its practical potential for diffusion-based editing tasks.

Abstract

Diffusion models have attained remarkable success in the domains of image generation and editing. It is widely recognized that employing larger inversion and denoising steps in diffusion model leads to improved image reconstruction quality. However, the editing performance of diffusion models tends to be no more satisfactory even with increasing denoising steps. The deficiency in editing could be attributed to the conditional Markovian property of the editing process, where errors accumulate throughout denoising steps. To tackle this challenge, we first propose an innovative framework where a rectifier module is incorporated to modulate diffusion model weights with residual features, thereby providing compensatory information to bridge the fidelity gap. Furthermore, we introduce a novel learning paradigm aimed at minimizing error propagation during the editing process, which trains the editing procedure in a manner similar to denoising score-matching. Extensive experiments demonstrate that our proposed framework and training strategy achieve high-fidelity reconstruction and editing results across various levels of denoising steps, meanwhile exhibits exceptional performance in terms of both quantitative metric and qualitative assessments. Moreover, we explore our model's generalization through several applications like image-to-image translation and out-of-domain image editing.
Paper Structure (25 sections, 6 equations, 18 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 6 equations, 18 figures, 2 tables, 1 algorithm.

Figures (18)

  • Figure 1: Reconstruction and editing results under various levels of inversion and denoising steps. While increasing steps makes reconstruction nearly perfect, the outcomes of editing still remain far from satisfactory (attribute: smiling).
  • Figure 2: Overview of our proposed rectifier framework. The rectifier is a hypernetwork consisting of a global encoder and multiple subnet branches. It takes as input the original image $\bm{x}_0$ and the estimation at each step ($\mathbb{P}_t[\bm{\epsilon}_t^\theta(\bm{x}_t)]$), targets to modulate the degraded residual features into offset weights, providing compensated information for high-fidelity reconstruction. We select the middle and up-sampling blocks of U-Net for modulate, considering that these blocks contain both high-level semantic information and low-level details. We also employ separable convolution to reduce the amount of generated parameters.
  • Figure 3: Editing training strategy. Instead of shifting from previous edited results in a Markovian style used in DiffusionCLIP (a), which may lead to error propagation, we start from the original trajectory at each step to find editing direction (b), further alleviating error accumulation caused in editing process.
  • Figure 4: Comparison of reconstruction quality under 50 steps. Our method is more robust to occlusions (1st column), illuminations (2nd column), viewpoints (3rd and 4th columns), and performs better at restoring coarse shapes (5th column) and preserving fine details (6th column).
  • Figure 5: Editing qualitative comparisons. Our method delivers realistic edits while maintaining low distortion and high fidelity.
  • ...and 13 more figures