Table of Contents
Fetching ...

Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

Ziyun Zeng, David Junhao Zhang, Wei Li, Mike Zheng Shou

TL;DR

Draw-In-Mind (DIM), a dataset comprising two complementary subsets containing 14M long-context image-text pairs to enhance complex instruction comprehension, is introduced, demonstrating that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing.

Abstract

In recent years, integrating multimodal understanding and generation into a single unified model has emerged as a promising paradigm. While this approach achieves strong results in text-to-image (T2I) generation, it still struggles with precise image editing. We attribute this limitation to an imbalanced division of responsibilities. The understanding module primarily functions as a translator that encodes user instructions into semantic conditions, while the generation module must simultaneously act as designer and painter, inferring the original layout, identifying the target editing region, and rendering the new content. This imbalance is counterintuitive because the understanding module is typically trained with several times more data on complex reasoning tasks than the generation module. To address this issue, we introduce Draw-In-Mind (DIM), a dataset comprising two complementary subsets: (i) DIM-T2I, containing 14M long-context image-text pairs to enhance complex instruction comprehension; and (ii) DIM-Edit, consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits. We connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight two-layer MLP, and train it on the proposed DIM dataset, resulting in DIM-4.6B-T2I/Edit. Despite its modest parameter scale, DIM-4.6B-Edit achieves SOTA or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit. These findings demonstrate that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing. Our dataset and models are available at https://github.com/showlab/DIM.

Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

TL;DR

Draw-In-Mind (DIM), a dataset comprising two complementary subsets containing 14M long-context image-text pairs to enhance complex instruction comprehension, is introduced, demonstrating that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing.

Abstract

In recent years, integrating multimodal understanding and generation into a single unified model has emerged as a promising paradigm. While this approach achieves strong results in text-to-image (T2I) generation, it still struggles with precise image editing. We attribute this limitation to an imbalanced division of responsibilities. The understanding module primarily functions as a translator that encodes user instructions into semantic conditions, while the generation module must simultaneously act as designer and painter, inferring the original layout, identifying the target editing region, and rendering the new content. This imbalance is counterintuitive because the understanding module is typically trained with several times more data on complex reasoning tasks than the generation module. To address this issue, we introduce Draw-In-Mind (DIM), a dataset comprising two complementary subsets: (i) DIM-T2I, containing 14M long-context image-text pairs to enhance complex instruction comprehension; and (ii) DIM-Edit, consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits. We connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight two-layer MLP, and train it on the proposed DIM dataset, resulting in DIM-4.6B-T2I/Edit. Despite its modest parameter scale, DIM-4.6B-Edit achieves SOTA or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit. These findings demonstrate that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing. Our dataset and models are available at https://github.com/showlab/DIM.

Paper Structure

This paper contains 24 sections, 20 figures, 15 tables.

Figures (20)

  • Figure 1: Upper: We employ a lightweight MLP connector to bridge a frozen MLLM, i.e., Qwen2.5-VL-3B qwen25vl, with a trainable DiT, i.e., SANA1.5-1.6B sana15, forming DIM-4.6B-Edit. In the editing process, we first leverage an external designer to produce a textual blueprint in a chain-of-thought style, which is then provided to DIM-4.6B-Edit to carry out precise image editing. Lower: DIM-4.6B-Edit establishes new state-of-the-art results on the challenging ImgEdit benchmark across diverse designers, while requiring $5\times$ fewer parameters than existing frontier models. These results highlight both the effectiveness of the proposed DIM dataset and the generalizability of our approach.
  • Figure 2: The creation pipeline of DIM-Edit begins with a quality assessment of existing image editing data, followed by prompt optimization using GPT-4o. Finally, the optimized prompts together with the corresponding image pairs are fed into GPT-4o, which generates a four-step chain-of-thought imagination in the textual space.
  • Figure 3: Green and Blue: the edits of Janus-4o and Step1X-Edit; Red: the edits of our models trained on different data corpora. All variants are tuned from the base checkpoint ❀ in Table \ref{['tab:ab_data']}.
  • Figure 4: The edits of Janus-4o, Step1X-Edit, and DIM-4.6B-Edit for the add operation.
  • Figure 5: The edits of Janus-4o, Step1X-Edit, and DIM-4.6B-Edit for the change operation.
  • ...and 15 more figures