Table of Contents
Fetching ...

Zero-shot Image Editing with Reference Imitation

Xi Chen, Yutong Feng, Mengting Chen, Yiyang Wang, Shilong Zhang, Yu Liu, Yujun Shen, Hengshuang Zhao

TL;DR

The paper tackles the challenge of local image editing guided by a reference image without requiring explicit reference masks. It introduces MimicBrush, a dual diffusion U-Nets framework that learns cross-image semantic correspondence by training on paired video frames and injecting reference features into the editing network to fill masked regions harmoniously with the background. A self-supervised training pipeline and a dedicated benchmark with Part Composition and Texture Transfer tasks demonstrate superior fidelity and blending across diverse domains, with comprehensive ablations and qualitative analyses supporting the approach. The work enables intuitive, cross-domain, region-level edits and provides a foundation for future exploration of reference-driven image editing without heavy annotations or fine-tuning.

Abstract

Image editing serves as a practical yet challenging task considering the diverse demands from users, where one of the hardest parts is to precisely describe how the edited image should look like. In this work, we present a new form of editing, termed imitative editing, to help users exercise their creativity more conveniently. Concretely, to edit an image region of interest, users are free to directly draw inspiration from some in-the-wild references (e.g., some relative pictures come across online), without having to cope with the fit between the reference and the source. Such a design requires the system to automatically figure out what to expect from the reference to perform the editing. For this purpose, we propose a generative training framework, dubbed MimicBrush, which randomly selects two frames from a video clip, masks some regions of one frame, and learns to recover the masked regions using the information from the other frame. That way, our model, developed from a diffusion prior, is able to capture the semantic correspondence between separate images in a self-supervised manner. We experimentally show the effectiveness of our method under various test cases as well as its superiority over existing alternatives. We also construct a benchmark to facilitate further research.

Zero-shot Image Editing with Reference Imitation

TL;DR

The paper tackles the challenge of local image editing guided by a reference image without requiring explicit reference masks. It introduces MimicBrush, a dual diffusion U-Nets framework that learns cross-image semantic correspondence by training on paired video frames and injecting reference features into the editing network to fill masked regions harmoniously with the background. A self-supervised training pipeline and a dedicated benchmark with Part Composition and Texture Transfer tasks demonstrate superior fidelity and blending across diverse domains, with comprehensive ablations and qualitative analyses supporting the approach. The work enables intuitive, cross-domain, region-level edits and provides a foundation for future exploration of reference-driven image editing without heavy annotations or fine-tuning.

Abstract

Image editing serves as a practical yet challenging task considering the diverse demands from users, where one of the hardest parts is to precisely describe how the edited image should look like. In this work, we present a new form of editing, termed imitative editing, to help users exercise their creativity more conveniently. Concretely, to edit an image region of interest, users are free to directly draw inspiration from some in-the-wild references (e.g., some relative pictures come across online), without having to cope with the fit between the reference and the source. Such a design requires the system to automatically figure out what to expect from the reference to perform the editing. For this purpose, we propose a generative training framework, dubbed MimicBrush, which randomly selects two frames from a video clip, masks some regions of one frame, and learns to recover the masked regions using the information from the other frame. That way, our model, developed from a diffusion prior, is able to capture the semantic correspondence between separate images in a self-supervised manner. We experimentally show the effectiveness of our method under various test cases as well as its superiority over existing alternatives. We also construct a benchmark to facilitate further research.
Paper Structure (13 sections, 1 equation, 7 figures, 4 tables)

This paper contains 13 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Diverse editing results produced by MimicBrush, where users only need to specify the to-edit regions in the source image (i.e., white masks) and provide an in-the-wild reference image illustrating how the regions are expected after editing. Our model automatically captures the semantic correspondence between them, and accomplishes the editing with a feedforward network execution.
  • Figure 2: Conceptual comparisons for different pipelines. To edit a local region, besides taking the source image and source mask (indicates the to-edit region), inpainting models use text prompts to guide the generation. Image composition methods take a reference image along with a mask/box to crop out the specific reference region. Differently, our pipeline simply takes a reference image, the reference regions are automatically discovered by the model itself.
  • Figure 3: The training process of MimicBrush. First, we randomly sample two frames from a video sequence as the reference and source image. The source image are then masked and exerted with data augmentation. Afterward, we feed the noisy image latent, mask, background latent, and depth latent of the source image into the imitative U-Net. The reference image is also augmented and sent to the reference U-Net. The dual U-Nets are trained to recover the masked area of source image. The attention keys and values of reference U-Net are concatenated with the imitative U-Net to assist the synthesis of the masked regions.
  • Figure 4: Sample illustration for our benchmark. It covers the task of part composition (first row) and texture transfer (second row). Each task includes a Inter-ID and inner-ID track. The annotated data and evaluation metrics for each track are illustrated beside the exemplar images.
  • Figure 5: Qualitative comparisons. Noticing that other methods require additional inputs. Firefly Firefly takes the detailed prompts descriptions. Besides, we mark the specific reference regions with boxes and masks for Paint-by-Example paintbyexample and AnyDoor anydoor. Even though, MimicBrush still demonstrates prominent advantages for both fidelity and harmony.
  • ...and 2 more figures