Table of Contents
Fetching ...

MagicQuill: An Intelligent Interactive Image Editing System

Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Wen Wang, Zhiheng Liu, Qifeng Chen, Yujun Shen

TL;DR

MagicQuill tackles the challenge of precise, interactive image editing by integrating a dual-branch diffusion-based Editing Processor with a painting-intent predicting Painting Assistor and a cross-platform Idea Collector UI. The Edit Processor provides edge- and color-guided control, while the Draw&Guess-based MLLM predicts contextually appropriate prompts to minimize manual input. Key contributions include a dedicated Draw&Guess dataset with LoRA-fine-tuned MLLMs, a plug-and-play editing toolkit compatible with multiple SD weights, and comprehensive user studies showing improved precision, efficiency, and usability over baselines. The work demonstrates strong generalization across fine-tuned diffusion models and validates the practicality of an open-source, interactive editing framework for creative workflows.

Abstract

Image editing involves a variety of complex tasks and requires efficient and precise manipulation techniques. In this paper, we present MagicQuill, an integrated image editing system that enables swift actualization of creative ideas. Our system features a streamlined yet functionally robust interface, allowing for the articulation of editing operations (e.g., inserting elements, erasing objects, altering color) with minimal input. These interactions are monitored by a multimodal large language model (MLLM) to anticipate editing intentions in real time, bypassing the need for explicit prompt entry. Finally, we apply a powerful diffusion prior, enhanced by a carefully learned two-branch plug-in module, to process editing requests with precise control. Experimental results demonstrate the effectiveness of MagicQuill in achieving high-quality image edits. Please visit https://magic-quill.github.io to try out our system.

MagicQuill: An Intelligent Interactive Image Editing System

TL;DR

MagicQuill tackles the challenge of precise, interactive image editing by integrating a dual-branch diffusion-based Editing Processor with a painting-intent predicting Painting Assistor and a cross-platform Idea Collector UI. The Edit Processor provides edge- and color-guided control, while the Draw&Guess-based MLLM predicts contextually appropriate prompts to minimize manual input. Key contributions include a dedicated Draw&Guess dataset with LoRA-fine-tuned MLLMs, a plug-and-play editing toolkit compatible with multiple SD weights, and comprehensive user studies showing improved precision, efficiency, and usability over baselines. The work demonstrates strong generalization across fine-tuned diffusion models and validates the practicality of an open-source, interactive editing framework for creative workflows.

Abstract

Image editing involves a variety of complex tasks and requires efficient and precise manipulation techniques. In this paper, we present MagicQuill, an integrated image editing system that enables swift actualization of creative ideas. Our system features a streamlined yet functionally robust interface, allowing for the articulation of editing operations (e.g., inserting elements, erasing objects, altering color) with minimal input. These interactions are monitored by a multimodal large language model (MLLM) to anticipate editing intentions in real time, bypassing the need for explicit prompt entry. Finally, we apply a powerful diffusion prior, enhanced by a carefully learned two-branch plug-in module, to process editing requests with precise control. Experimental results demonstrate the effectiveness of MagicQuill in achieving high-quality image edits. Please visit https://magic-quill.github.io to try out our system.

Paper Structure

This paper contains 27 sections, 7 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: System framework consisting of three integrated components: an Editing Processor with dual-branch architecture for controllable image inpainting, a Painting Assistor for real-time intent prediction, and an Idea Collector offering versatile brush tools. This design enables intuitive and precise image editing through brushstroke-based interactions.
  • Figure 2: Data processing pipeline. The input image undergoes edge extraction via CNN and color simplification through downscaling. Three editing conditions are then generated based on brush signals: editing mask, edge condition, and color condition, which together provide control for image editing.
  • Figure 3: Overview of our Editing Processor. The proposed architecture extends the latent diffusion UNet with two specialized branches: an inpainting branch for content-aware per-pixel inpainting guidance and a control branch for structural guidance, enabling precise brush-based image editing.
  • Figure 4: Illustration of dataset construction process. (a) Original images from the DCI dataset; (b) Edge maps extracted from original images; (c) Selected masks (highlighted in purple) with highest edge density; (d) Results after BrushNet inpainting on augmented masked regions; (e) Final results with edge map overlay on selected areas. By overlaying edge maps on inpainted results, we simulate scenarios where users edit images with brush strokes, as the edge maps resemble hand-drawn sketches. The bounding box coordinates of the mask and labels are inherited from the DCI dataset.
  • Figure 5: Visual result comparison. The first two columns present the edge and color conditions for editing, while the last column shows the ground truth image that the models aim to recreate. SmartEdit smartedit utilizes natural language for guidance, but lacks precision in controlling shape and color, often affecting non-target regions. SketchEdit sketchedit, a GAN-based approach GAN, struggles with open-domain image generation, falling short compared to models with diffusion-based generative priors. Although BrushNet brushnet delivers seamless image inpainting, it struggles to align edges and colors simultaneously, even with ControlNet controlnet enhancement. In contrast, our Editing Processor strictly adheres to both edge and color conditions, achieving high-fidelity conditional image editing.
  • ...and 16 more figures