BrushEdit: All-In-One Image Inpainting and Editing
Yaowei Li, Yuxuan Bian, Xuan Ju, Zhaoyang Zhang, Junhao Zhuang, Ying Shan, Yuexian Zou, Qiang Xu
TL;DR
BrushEdit addresses the challenge of flexible, instruction-guided image editing by unifying multimodal language understanding with a versatile, all-in-one inpainting backbone. It introduces an Editing Instructor (MLLM-based) and an Editing Conductor (dual-branch BrushNet-in-diffusion) in an agent-cooperative loop, enabling free-form, multi-turn edits without task-specific retraining. Trained on the expanded BrushData-v2 and evaluated on PIE-Bench, BrushBench, and EditBench, BrushEdit demonstrates superior background preservation and text alignment across editing and inpainting tasks, while offering plug-and-play compatibility with multiple diffusion backbones. The work highlights practical benefits for content creators and realistic limitations related to base-model quality and mask irregularities, along with responsible-use considerations for potential societal impact.
Abstract
Image editing has advanced significantly with the development of diffusion models using both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with big modifications (e.g., adding or removing objects) due to the structured nature of inversion noise, which hinders substantial changes. Meanwhile, instruction-based methods often constrain users to black-box operations, limiting direct interaction for specifying editing regions and intensity. To address these limitations, we propose BrushEdit, a novel inpainting-based instruction-guided image editing paradigm, which leverages multimodal large language models (MLLMs) and image inpainting models to enable autonomous, user-friendly, and interactive free-form instruction editing. Specifically, we devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model in an agent-cooperative framework to perform editing category classification, main object identification, mask acquisition, and editing area inpainting. Extensive experiments show that our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics including mask region preservation and editing effect coherence.
