Table of Contents
Fetching ...

Object-aware Inversion and Reassembly for Image Editing

Zhen Yang, Ganggui Ding, Wen Wang, Hao Chen, Bohan Zhuang, Chunhua Shen

TL;DR

The paper tackles suboptimal diffusion-based text-driven image editing caused by uniformly fixed inversion steps across editing pairs. It introduces Object-aware Inversion and Reassembly (OIR), which computes per-editing-pair optimal inversion steps via a CLIP-guided metric that balances editability with non-editing fidelity and uses a disassembly/reassembly pipeline to edit objects independently before a final fusion. Two new datasets, OIRBench, benchmark single- and multi-object editing, and experiments show that OIR delivers superior multi-object edits with strong perceptual alignment while remaining competitive on single-object tasks. The approach is training-free and emphasizes object-level control, though it incurs extra inference time and invites future work on efficiency and broader editing domains such as video.

Abstract

By comparing the original and target prompts, we can obtain numerous editing pairs, each comprising an object and its corresponding editing target. To allow editability while maintaining fidelity to the input image, existing editing methods typically involve a fixed number of inversion steps that project the whole input image to its noisier latent representation, followed by a denoising process guided by the target prompt. However, we find that the optimal number of inversion steps for achieving ideal editing results varies significantly among different editing pairs, owing to varying editing difficulties. Therefore, the current literature, which relies on a fixed number of inversion steps, produces sub-optimal generation quality, especially when handling multiple editing pairs in a natural image. To this end, we propose a new image editing paradigm, dubbed Object-aware Inversion and Reassembly (OIR), to enable object-level fine-grained editing. Specifically, we design a new search metric, which determines the optimal inversion steps for each editing pair, by jointly considering the editability of the target and the fidelity of the non-editing region. We use our search metric to find the optimal inversion step for each editing pair when editing an image. We then edit these editing pairs separately to avoid concept mismatch. Subsequently, we propose an additional reassembly step to seamlessly integrate the respective editing results and the non-editing region to obtain the final edited image. To systematically evaluate the effectiveness of our method, we collect two datasets called OIRBench for benchmarking single- and multi-object editing, respectively. Experiments demonstrate that our method achieves superior performance in editing object shapes, colors, materials, categories, etc., especially in multi-object editing scenarios.

Object-aware Inversion and Reassembly for Image Editing

TL;DR

The paper tackles suboptimal diffusion-based text-driven image editing caused by uniformly fixed inversion steps across editing pairs. It introduces Object-aware Inversion and Reassembly (OIR), which computes per-editing-pair optimal inversion steps via a CLIP-guided metric that balances editability with non-editing fidelity and uses a disassembly/reassembly pipeline to edit objects independently before a final fusion. Two new datasets, OIRBench, benchmark single- and multi-object editing, and experiments show that OIR delivers superior multi-object edits with strong perceptual alignment while remaining competitive on single-object tasks. The approach is training-free and emphasizes object-level control, though it incurs extra inference time and invites future work on efficiency and broader editing domains such as video.

Abstract

By comparing the original and target prompts, we can obtain numerous editing pairs, each comprising an object and its corresponding editing target. To allow editability while maintaining fidelity to the input image, existing editing methods typically involve a fixed number of inversion steps that project the whole input image to its noisier latent representation, followed by a denoising process guided by the target prompt. However, we find that the optimal number of inversion steps for achieving ideal editing results varies significantly among different editing pairs, owing to varying editing difficulties. Therefore, the current literature, which relies on a fixed number of inversion steps, produces sub-optimal generation quality, especially when handling multiple editing pairs in a natural image. To this end, we propose a new image editing paradigm, dubbed Object-aware Inversion and Reassembly (OIR), to enable object-level fine-grained editing. Specifically, we design a new search metric, which determines the optimal inversion steps for each editing pair, by jointly considering the editability of the target and the fidelity of the non-editing region. We use our search metric to find the optimal inversion step for each editing pair when editing an image. We then edit these editing pairs separately to avoid concept mismatch. Subsequently, we propose an additional reassembly step to seamlessly integrate the respective editing results and the non-editing region to obtain the final edited image. To systematically evaluate the effectiveness of our method, we collect two datasets called OIRBench for benchmarking single- and multi-object editing, respectively. Experiments demonstrate that our method achieves superior performance in editing object shapes, colors, materials, categories, etc., especially in multi-object editing scenarios.
Paper Structure (23 sections, 3 equations, 22 figures, 3 tables)

This paper contains 23 sections, 3 equations, 22 figures, 3 tables.

Figures (22)

  • Figure 1: Motivation. In the process of text-driven image editing, we first inverse the original image to progressively acquire all latents. Then, we denoise each latent to generate images under the guidance of the target prompt. After obtaining all the images, the most optimally edited results are selected by human. From the first and second rows, we note that different editing pairs have unique optimal inversion steps. Moreover, we observe editing different editing pairs with the same inversion step results in concept mismatch or poor editing, as shown in the third row.
  • Figure 2: Overview of the optimal inversion step search pipeline. (a) For an editing pair, we obtain the candidate images by denoising each inverted latent. (b) We use a mask generator to jointly compute the metrics $S_{e}$ and $S_{ne}$, and finally we obtain $S$ by computing their average.
  • Figure 3: Overview of object-aware inversion and reassembly. (a) We create guided prompts for all editing pairs using $P_o$ and $P_t$. (b) For each editing pair, we utilize the optimal inversion step search pipeline to automatically find the optimal inversion step. (c) From each optimal inversion step, we guide the denoising individually using its guided prompt. We crop the denoised latent of the editing regions and splice them with the inverted latent of the non-editing region's at the reassembly step. Subsequently, we apply a re-inversion process to the reassembled latent and denoise it guided by $P_t$.
  • Figure 4: Qualitative comparisons. From top to bottom: original image, our method (OIR), PNP tumanyan2023plug, Stable Diffusion Inpainting, DiffEdit couairon2022diffedit, Null-text Inversion mokady2023null. The texts at the top of the images represent editing pairs.
  • Figure 5: User study results. Users are asked to select the best results in terms of the alignment to target prompts and detail preservation of the input image.
  • ...and 17 more figures