Table of Contents
Fetching ...

Edicho: Consistent Image Editing in the Wild

Qingyan Bai, Hao Ouyang, Yinghao Xu, Qiuyu Wang, Ceyuan Yang, Ka Leong Cheng, Yujun Shen, Qifeng Chen

TL;DR

Edicho tackles inconsistent cross-image edits in uncontrolled real-world images by introducing explicit correspondence into diffusion-based editing. It combines Corr-Attention and Corr-CFG to steer denoising with pre-estimated image correspondences, enabling training-free, plug-and-play edits that generalize across images and editing tasks. Quantitative and qualitative results show superior text alignment and editing consistency over strong baselines, along with practical applications in customization and 3D reconstruction. The approach preserves pre-trained generative priors and demonstrates robust performance in diverse, in-the-wild scenarios, with limitations mainly arising from correlation misalignment and potential texture distortions to be mitigated with better extractors.

Abstract

As a verified need, consistent editing across in-the-wild images remains a technical challenge arising from various unmanageable factors, like object poses, lighting conditions, and photography environments. Edicho steps in with a training-free solution based on diffusion models, featuring a fundamental design principle of using explicit image correspondence to direct editing. Specifically, the key components include an attention manipulation module and a carefully refined classifier-free guidance (CFG) denoising strategy, both of which take into account the pre-estimated correspondence. Such an inference-time algorithm enjoys a plug-and-play nature and is compatible to most diffusion-based editing methods, such as ControlNet and BrushNet. Extensive results demonstrate the efficacy of Edicho in consistent cross-image editing under diverse settings. We will release the code to facilitate future studies.

Edicho: Consistent Image Editing in the Wild

TL;DR

Edicho tackles inconsistent cross-image edits in uncontrolled real-world images by introducing explicit correspondence into diffusion-based editing. It combines Corr-Attention and Corr-CFG to steer denoising with pre-estimated image correspondences, enabling training-free, plug-and-play edits that generalize across images and editing tasks. Quantitative and qualitative results show superior text alignment and editing consistency over strong baselines, along with practical applications in customization and 3D reconstruction. The approach preserves pre-trained generative priors and demonstrates robust performance in diverse, in-the-wild scenarios, with limitations mainly arising from correlation misalignment and potential texture distortions to be mitigated with better extractors.

Abstract

As a verified need, consistent editing across in-the-wild images remains a technical challenge arising from various unmanageable factors, like object poses, lighting conditions, and photography environments. Edicho steps in with a training-free solution based on diffusion models, featuring a fundamental design principle of using explicit image correspondence to direct editing. Specifically, the key components include an attention manipulation module and a carefully refined classifier-free guidance (CFG) denoising strategy, both of which take into account the pre-estimated correspondence. Such an inference-time algorithm enjoys a plug-and-play nature and is compatible to most diffusion-based editing methods, such as ControlNet and BrushNet. Extensive results demonstrate the efficacy of Edicho in consistent cross-image editing under diverse settings. We will release the code to facilitate future studies.
Paper Structure (19 sections, 8 equations, 14 figures, 1 table)

This paper contains 19 sections, 8 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Given two images in the wild, Edicho generates consistent editing versions of them in a zero-shot manner. Our approach achieves precise consistency for editing parts (left), objects (middle), and the entire images (right) by leveraging explicit correspondence.
  • Figure 2: Comparisons of the implicit and our explicit correspondence prediction for the images in the wild. The implicit correspondence from cross-image attention calculation is less accurate and unstable with the change of denoising steps and network layers.
  • Figure 3: Framework of Edicho. To achieve consistent editing, we first predict the explicit correspondence with extractors for the input images. The pre-computed correspondence is injected into the pre-trained diffusion models and guide the denoising in the two levels of (a) attention features and (b) noisy latents in CFG.
  • Figure 4: Qualitative comparisons on local editing with Adobe Firefly (AF) Firefly, Anydoor (AD) chen2024anydoor, and Paint-by-Example (PBE) yang2023paint. The inpainted areas of the inputs are highlighted in red.
  • Figure 5: Qualitative comparisons on global editing with MasaCtrl (MC) cao2023masactrl, StyleAligned (SA) hertz2024stylealign, and Cross-Image-Attention (CIA) alaluf2024cross.
  • ...and 9 more figures