Table of Contents
Fetching ...

FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

Runze He, Kai Ma, Linjiang Huang, Shaofei Huang, Jialin Gao, Xiaoming Wei, Jiao Dai, Jizhong Han, Si Liu

TL;DR

This work tackles mask-free, reference-based image editing by introducing FreeEdit, which uses a multi-modal instruction encoder to guide edits with language while incorporating a reference image for identity consistency. A Decoupled Residual Refer-Attention (DRRA) module and a detail extractor enable faithful reconstruction of reference details without perturbing the original self-attention, and a two-stage training plus quality tuning yields high-quality edits. To support this task, the authors create FreeBench via a twice-repainting scheme that ensures identity alignment between edited and reference objects and enables multi-modal editing instructions. Experiments show FreeEdit outperforms existing mask-free methods and remains competitive with mask-based approaches, while also supporting plain-text editing and mask-free virtual try-on. The work provides a new dataset and a robust framework for practical, flexible reference-based editing without manual masks.

Abstract

Introducing user-specified visual concepts in image editing is highly practical as these concepts convey the user's intent more precisely than text-based descriptions. We propose FreeEdit, a novel approach for achieving such reference-based image editing, which can accurately reproduce the visual concept from the reference image based on user-friendly language instructions. Our approach leverages the multi-modal instruction encoder to encode language instructions to guide the editing process. This implicit way of locating the editing area eliminates the need for manual editing masks. To enhance the reconstruction of reference details, we introduce the Decoupled Residual ReferAttention (DRRA) module. This module is designed to integrate fine-grained reference features extracted by a detail extractor into the image editing process in a residual way without interfering with the original self-attention. Given that existing datasets are unsuitable for reference-based image editing tasks, particularly due to the difficulty in constructing image triplets that include a reference image, we curate a high-quality dataset, FreeBench, using a newly developed twice-repainting scheme. FreeBench comprises the images before and after editing, detailed editing instructions, as well as a reference image that maintains the identity of the edited object, encompassing tasks such as object addition, replacement, and deletion. By conducting phased training on FreeBench followed by quality tuning, FreeEdit achieves high-quality zero-shot editing through convenient language instructions. We conduct extensive experiments to evaluate the effectiveness of FreeEdit across multiple task types, demonstrating its superiority over existing methods. The code will be available at: https://freeedit.github.io/.

FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

TL;DR

This work tackles mask-free, reference-based image editing by introducing FreeEdit, which uses a multi-modal instruction encoder to guide edits with language while incorporating a reference image for identity consistency. A Decoupled Residual Refer-Attention (DRRA) module and a detail extractor enable faithful reconstruction of reference details without perturbing the original self-attention, and a two-stage training plus quality tuning yields high-quality edits. To support this task, the authors create FreeBench via a twice-repainting scheme that ensures identity alignment between edited and reference objects and enables multi-modal editing instructions. Experiments show FreeEdit outperforms existing mask-free methods and remains competitive with mask-based approaches, while also supporting plain-text editing and mask-free virtual try-on. The work provides a new dataset and a robust framework for practical, flexible reference-based editing without manual masks.

Abstract

Introducing user-specified visual concepts in image editing is highly practical as these concepts convey the user's intent more precisely than text-based descriptions. We propose FreeEdit, a novel approach for achieving such reference-based image editing, which can accurately reproduce the visual concept from the reference image based on user-friendly language instructions. Our approach leverages the multi-modal instruction encoder to encode language instructions to guide the editing process. This implicit way of locating the editing area eliminates the need for manual editing masks. To enhance the reconstruction of reference details, we introduce the Decoupled Residual ReferAttention (DRRA) module. This module is designed to integrate fine-grained reference features extracted by a detail extractor into the image editing process in a residual way without interfering with the original self-attention. Given that existing datasets are unsuitable for reference-based image editing tasks, particularly due to the difficulty in constructing image triplets that include a reference image, we curate a high-quality dataset, FreeBench, using a newly developed twice-repainting scheme. FreeBench comprises the images before and after editing, detailed editing instructions, as well as a reference image that maintains the identity of the edited object, encompassing tasks such as object addition, replacement, and deletion. By conducting phased training on FreeBench followed by quality tuning, FreeEdit achieves high-quality zero-shot editing through convenient language instructions. We conduct extensive experiments to evaluate the effectiveness of FreeEdit across multiple task types, demonstrating its superiority over existing methods. The code will be available at: https://freeedit.github.io/.
Paper Structure (23 sections, 7 equations, 14 figures, 7 tables)

This paper contains 23 sections, 7 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Comparison between mask-based paradigm and mask-free paradigm, the former requires the user to provide the source mask to specify the editing area, while the latter conditions the diffusion model with language instructions, without the need for the masks. Reference-based inpainting conditions the model on reference image embedding, no longer supporting natural language. We use multi-modal instruction to introduce reference image features while still retaining the perception of natural language.
  • Figure 2: The overall pipeline of our proposed FreeEdit, which consists of three components: (a) Multi-modal instruction encoder. (b) Detail extractor. (c) Denosing U-Net. Text instruction and reference image are firstly fed into the multi-modal instruction encoder to generate multi-modal instruction embedding. The reference image is additionally fed into the detail extractor to obtain fine-grained features. The original image latent is concatenated with the noise latent to introduce the original image condition. Denosing U-Net accepts the 8-channel input and interacts with the multi-modal instruction embedding through cross-attention. The DRRA modules which connect the detail extractor and the denoising U-Net, are used to integrate fine-grained features from the detail extractor to promote ID consistency with the reference image. (d) The editing examples obtained using FreeEdit.
  • Figure 3: Comparison between (a) Self-Attention (b) Refer-Attention in Self-Attention (RASA) and (c) Decoupled Residual Refer-Attention (DRRA). RASA performs additional attention to reference features obtained from the detail extractor by concatenating them to the original self-attention module. DRRA retains the original self-attention and implements the decoupled reference attention in the form of residual connection.
  • Figure 4: Pipeline for dataset construction and examples of training samples. (a) Image triplet construction. We repaint the source image in the existing real-world segmentation dataset twice to form the image triplet. (b) Instruction Construction. We use multiple powerful MLLMs to caption the generated image, and combine the resulting local descriptions with instruction templates to form edit instructions. (c) Examples of the training dataset. The item in the dataset contains images before and after editing and a multi-modal instruction.
  • Figure 5: Statistics for the FreeBench dataset. The first four parent classes in FreeBench are animals, food, kitchenware, and vehicles. FreeBench covers the vast majority of categories in daily life, allowing us to train a generalizable zero-shot reference-based image editing model.
  • ...and 9 more figures