Table of Contents
Fetching ...

InstructX: Towards Unified Visual Editing with MLLM Guidance

Chong Mou, Qichao Sun, Yanze Wu, Pengze Zhang, Xinghui Li, Fulong Ye, Songtao Zhao, Qian He

TL;DR

InstructX presents a unified framework that leverages Multimodal Large Language Models to guide diffusion-based editing for both images and videos. By using learnable queries appended to the MLLM, a lightweight MLP connector, and modality-aware training (including LoRA for the MLLM), the approach achieves fast convergence and strong editing performance across tasks. A three-stage training strategy, mixed image-video data, and a new VIE-Bench video-editing benchmark underpin extensive experiments that demonstrate state-of-the-art results on image benchmarks and competitive performance in video editing against closed-source methods. The work provides practical insights into MLLM–diffusion integration, suggesting that editing should be largely realized within the MLLM domain while maintaining a lean generation bridge, enabling broad instructional editing capabilities.

Abstract

With recent advances in Multimodal Large Language Models (MLLMs) showing strong visual understanding and reasoning, interest is growing in using them to improve the editing performance of diffusion models. Despite rapid progress, most studies lack an in-depth analysis of MLLM design choices. Moreover, the integration of MLLMs and diffusion models remains an open challenge in some difficult tasks, such as video editing. In this paper, we present InstructX, a unified framework for image and video editing. Specifically, we conduct a comprehensive study on integrating MLLMs and diffusion models for instruction-driven editing across diverse tasks. Building on this study, we analyze the cooperation and distinction between images and videos in unified modeling. (1) We show that training on image data can lead to emergent video editing capabilities without explicit supervision, thereby alleviating the constraints imposed by scarce video training data. (2) By incorporating modality-specific MLLM features, our approach effectively unifies image and video editing tasks within a single model. Extensive experiments demonstrate that our method can handle a broad range of image and video editing tasks and achieves state-of-the-art performance.

InstructX: Towards Unified Visual Editing with MLLM Guidance

TL;DR

InstructX presents a unified framework that leverages Multimodal Large Language Models to guide diffusion-based editing for both images and videos. By using learnable queries appended to the MLLM, a lightweight MLP connector, and modality-aware training (including LoRA for the MLLM), the approach achieves fast convergence and strong editing performance across tasks. A three-stage training strategy, mixed image-video data, and a new VIE-Bench video-editing benchmark underpin extensive experiments that demonstrate state-of-the-art results on image benchmarks and competitive performance in video editing against closed-source methods. The work provides practical insights into MLLM–diffusion integration, suggesting that editing should be largely realized within the MLLM domain while maintaining a lean generation bridge, enabling broad instructional editing capabilities.

Abstract

With recent advances in Multimodal Large Language Models (MLLMs) showing strong visual understanding and reasoning, interest is growing in using them to improve the editing performance of diffusion models. Despite rapid progress, most studies lack an in-depth analysis of MLLM design choices. Moreover, the integration of MLLMs and diffusion models remains an open challenge in some difficult tasks, such as video editing. In this paper, we present InstructX, a unified framework for image and video editing. Specifically, we conduct a comprehensive study on integrating MLLMs and diffusion models for instruction-driven editing across diverse tasks. Building on this study, we analyze the cooperation and distinction between images and videos in unified modeling. (1) We show that training on image data can lead to emergent video editing capabilities without explicit supervision, thereby alleviating the constraints imposed by scarce video training data. (2) By incorporating modality-specific MLLM features, our approach effectively unifies image and video editing tasks within a single model. Extensive experiments demonstrate that our method can handle a broad range of image and video editing tasks and achieves state-of-the-art performance.

Paper Structure

This paper contains 21 sections, 22 figures, 4 tables.

Figures (22)

  • Figure 1: Overview of InstructX. The MLLM serves as the understanding module, generating editing guidance given the input instruction and visual inputs. The DiT serves as the generation module and connects to the MLLM via learnable queries and an MLP connector.
  • Figure 2: Different design choices for unified editing modeling.
  • Figure 3: Illustration of alignment ability (left) and editing performance (right) for different design choices.
  • Figure 4: Illustration of three training stages of our methods.
  • Figure 5: Examples for emergent video editing capabilities through image data.
  • ...and 17 more figures