MIGE: Mutually Enhanced Multimodal Instruction-Based Image Generation and Editing
Xueyun Tian, Wei Li, Bingbing Xu, Yige Yuan, Yuanzhuo Wang, Huawei Shen
TL;DR
The paper addresses the challenge of unifying subject-driven image generation and instruction-based editing by introducing MIGE, a framework that leverages multimodal instructions and a unified input-output formulation. It deploys a novel multimodal encoder that fuses semantic and visual features via a dedicated fusion mechanism and couples it with a diffusion-based generator, enabling joint training across tasks. This cross-task training yields mutual improvements in instruction adherence and subject preservation, and enables generalization to new compositional tasks such as instruction-based subject-driven editing. To support this, the authors develop a data construction pipeline and MIGEBench, reporting state-of-the-art performance on MIGEBench and strong results on existing benchmarks, with code and models publicly released.
Abstract
Despite significant progress in diffusion-based image generation, subject-driven generation and instruction-based editing remain challenging. Existing methods typically treat them separately, struggling with limited high-quality data and poor generalization. However, both tasks require capturing complex visual variations while maintaining consistency between inputs and outputs. Inspired by this, we propose MIGE, a unified framework that standardizes task representations using multimodal instructions. It first treats subject-driven generation as creation on a blank canvas and instruction-based editing as modification of an existing image, establishing a shared input-output formulation, then introduces a novel multimodal encoder that maps free-form multimodal instructions into a unified vision-language space, integrating visual and semantic features through a feature fusion mechanism. This unification enables joint training of both tasks, providing two key advantages: (1) Cross-Task Enhancement: by leveraging shared visual and semantic representations, joint training improves instruction adherence and visual consistency in both subject-driven generation and instruction-based editing. (2) Generalization: learning in a unified format facilitates cross-task knowledge transfer, enabling MIGE to generalize to novel compositional tasks, including instruction-based subject-driven editing. Experiments show that MIGE excels in both subject-driven generation and instruction-based editing while setting a SOTA in the new task of instruction-based subject-driven editing. Code and model have been publicly available at https://github.com/Eureka-Maggie/MIGE.
