Table of Contents
Fetching ...

BrushEdit: All-In-One Image Inpainting and Editing

Yaowei Li, Yuxuan Bian, Xuan Ju, Zhaoyang Zhang, Junhao Zhuang, Ying Shan, Yuexian Zou, Qiang Xu

TL;DR

BrushEdit addresses the challenge of flexible, instruction-guided image editing by unifying multimodal language understanding with a versatile, all-in-one inpainting backbone. It introduces an Editing Instructor (MLLM-based) and an Editing Conductor (dual-branch BrushNet-in-diffusion) in an agent-cooperative loop, enabling free-form, multi-turn edits without task-specific retraining. Trained on the expanded BrushData-v2 and evaluated on PIE-Bench, BrushBench, and EditBench, BrushEdit demonstrates superior background preservation and text alignment across editing and inpainting tasks, while offering plug-and-play compatibility with multiple diffusion backbones. The work highlights practical benefits for content creators and realistic limitations related to base-model quality and mask irregularities, along with responsible-use considerations for potential societal impact.

Abstract

Image editing has advanced significantly with the development of diffusion models using both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with big modifications (e.g., adding or removing objects) due to the structured nature of inversion noise, which hinders substantial changes. Meanwhile, instruction-based methods often constrain users to black-box operations, limiting direct interaction for specifying editing regions and intensity. To address these limitations, we propose BrushEdit, a novel inpainting-based instruction-guided image editing paradigm, which leverages multimodal large language models (MLLMs) and image inpainting models to enable autonomous, user-friendly, and interactive free-form instruction editing. Specifically, we devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model in an agent-cooperative framework to perform editing category classification, main object identification, mask acquisition, and editing area inpainting. Extensive experiments show that our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics including mask region preservation and editing effect coherence.

BrushEdit: All-In-One Image Inpainting and Editing

TL;DR

BrushEdit addresses the challenge of flexible, instruction-guided image editing by unifying multimodal language understanding with a versatile, all-in-one inpainting backbone. It introduces an Editing Instructor (MLLM-based) and an Editing Conductor (dual-branch BrushNet-in-diffusion) in an agent-cooperative loop, enabling free-form, multi-turn edits without task-specific retraining. Trained on the expanded BrushData-v2 and evaluated on PIE-Bench, BrushBench, and EditBench, BrushEdit demonstrates superior background preservation and text alignment across editing and inpainting tasks, while offering plug-and-play compatibility with multiple diffusion backbones. The work highlights practical benefits for content creators and realistic limitations related to base-model quality and mask irregularities, along with responsible-use considerations for potential societal impact.

Abstract

Image editing has advanced significantly with the development of diffusion models using both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with big modifications (e.g., adding or removing objects) due to the structured nature of inversion noise, which hinders substantial changes. Meanwhile, instruction-based methods often constrain users to black-box operations, limiting direct interaction for specifying editing regions and intensity. To address these limitations, we propose BrushEdit, a novel inpainting-based instruction-guided image editing paradigm, which leverages multimodal large language models (MLLMs) and image inpainting models to enable autonomous, user-friendly, and interactive free-form instruction editing. Specifically, we devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model in an agent-cooperative framework to perform editing category classification, main object identification, mask acquisition, and editing area inpainting. Extensive experiments show that our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics including mask region preservation and editing effect coherence.

Paper Structure

This paper contains 30 sections, 5 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: BrushEdit is a cutting-edge interactive image editing framework that combines language models and inpainting techniques for seamless edits. Leveraging pre-trained multimodal language models and BrushNet's dual-branch architecture, users can achieve diverse edits such as adding objects, removing elements, or making structural changes with free-form masks.
  • Figure 2: BrushEdit can achieve all-in-one inpainting for arbitrary mask shapes without requiring separate model training for each mask type. This flexibility in handling arbitrary shapes also enhances user-driven editing, as user-provided masks often combine segmentation-based structural details with random mask noise. By supporting arbitrary mask shapes, BrushEdit avoids the artifacts introduced by the random-mask version of BrushNet-Ran and the edge inconsistencies caused by the segmentation-mask version BrushNet-Seg's strong reliance on boundary shapes.
  • Figure 3: Model overview. Our model outputs an inpainted image given the mask and masked image input. Firstly, we downsample the mask to accommodate the size of the latent, and input the masked image to the VAE encoder to align the distribution of latent space. Then, noisy latent, masked image latent, and downsampled mask are concatenated as the input of BrushEdit. The feature extracted from BrushEdit is added to pretrained UNet layer by layer after a zero convolution blockzhang2023adding. After denoising, the generated image and masked image are blended with a blurred mask.
  • Figure 4: Benchmark overview. I and II separately show natural and artificial images, masks, and caption of BrushBench. (a) to (d) show images of humans, animals, indoor scenarios, and outdoor scenarios. Each group of images shows the original image, inside-inpainting mask, and outside-inpainting mask, with an image caption on the top. III show image, mask, and caption from EditBench wang2023imagen, with (e) for generated images and (f) for natural images. The images are randomly selected from both benchmarks.
  • Figure 5: Comparison of previous editing methods and BrushEdit on natural and synthetic images, covering image editing operations such as removing objects (I), adding objects (II), modifying attributes (III), and swapping objects (IV).
  • ...and 3 more figures