Table of Contents
Fetching ...

DreamOmni2: Multimodal Instruction-based Editing and Generation

Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia

TL;DR

DreamOmni2 addresses the limitations of instruction-only editing and subject-driven generation by introducing multimodal instruction-based editing and generation tasks that support both text and image instructions for concrete and abstract concepts. It presents a three-stage data synthesis pipeline using feature mixing to create extraction data, training an extraction model to generate reference images, and producing training data for editing and generation; concurrently, it proposes a DreamOmni2 framework with index encoding and position encoding shifts and joint training with a vision-language model. A new benchmark based on real images evaluates editing and generation with multiple reference images, including abstract attributions. Experimental results show DreamOmni2 achieves state-of-the-art performance against open-source baselines and approaches commercial models, with ablations confirming the value of the encoding schemes and joint training. The work paves the way for practical, flexible multimodal content creation tools and provides datasets and benchmarks for further research.

Abstract

Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.

DreamOmni2: Multimodal Instruction-based Editing and Generation

TL;DR

DreamOmni2 addresses the limitations of instruction-only editing and subject-driven generation by introducing multimodal instruction-based editing and generation tasks that support both text and image instructions for concrete and abstract concepts. It presents a three-stage data synthesis pipeline using feature mixing to create extraction data, training an extraction model to generate reference images, and producing training data for editing and generation; concurrently, it proposes a DreamOmni2 framework with index encoding and position encoding shifts and joint training with a vision-language model. A new benchmark based on real images evaluates editing and generation with multiple reference images, including abstract attributions. Experimental results show DreamOmni2 achieves state-of-the-art performance against open-source baselines and approaches commercial models, with ablations confirming the value of the encoding schemes and joint training. The work paves the way for practical, flexible multimodal content creation tools and provides datasets and benchmarks for further research.

Abstract

Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.

Paper Structure

This paper contains 12 sections, 1 equation, 30 figures, 5 tables.

Figures (30)

  • Figure 1: The gallery of overview: Enabling multimodal instruction-based editing and generation, extending beyond concrete objects to abstract attributions.
  • Figure 2: The Overview of DreamOmni2's training data construction. (1) In stage 1, we use a feature mixing scheme to leverage the base model's T2I capabilities, creating high-quality data pairs with concrete objects and abstract attributes. (2) In stage 2, we generate multimodal instruction-based editing data. Using stage 1 data, we train an extraction model to simulate objects or attributes in the target image and generate a reference image based on instructions. Additionally, we use an instruction-based editing model to modify the extracted objects or attributes in the target image to be different, creating the source image. This generates training pairs from reference and source images to the target image. (3) In stage 3, we extract objects from stage 2’s source images to create new reference images, forming training data for generating target images from reference images.
  • Figure 3: Data distribution and samples for multimodal instruction-based editing and generation training data. Our dataset is comprehensive and diverse, including the generation and editing of concrete objects as well as abstract attributions, such as local and global attributions.
  • Figure 4: Visual comparison of multimodal instruction-based editing. Compared to other competitive methods and even closed-source commercial models (GPT-4o and Nano Banana), DreamOmni2 shows more accurate editing results and better consistency.
  • Figure 5: Visual comparison of multimodal instruction-based generation. Our DreamOmni2 significantly outperforms current open-source models and achieves generation results comparable to closed-source commercial models (GPT-4 and Nano Banana).
  • ...and 25 more figures