Table of Contents
Fetching ...

DreamOmni: Unified Image Generation and Editing

Bin Xia, Yuechen Zhang, Jingyao Li, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia

TL;DR

DreamOmni tackles the lack of a unified framework for image generation and editing by analyzing existing diffusion-model architectures and introducing a Vision-Language–conditioned DIT-based latent diffusion backbone. It pairs a synthetic collage data pipeline with multi-task training to scale high-quality editing data while preserving T2I generation quality, enabling efficient joint learning of generation and editing tasks. Empirical results demonstrate improved generation fidelity, editing accuracy, and robustness across instruction-based, drag, inpainting/outpainting, and reference-image tasks, with ablations showing fast convergence and the benefit of concentrating computations on higher-resolution latents. The approach offers a practical pathway to deploy and scale unified image generation and editing models, and the authors plan to release code and models.

Abstract

Currently, the success of large language models (LLMs) illustrates that a unified multitasking approach can significantly enhance model usability, streamline deployment, and foster synergistic benefits across different tasks. However, in computer vision, while text-to-image (T2I) models have significantly improved generation quality through scaling up, their framework design did not initially consider how to unify with downstream tasks, such as various types of editing. To address this, we introduce DreamOmni, a unified model for image generation and editing. We begin by analyzing existing frameworks and the requirements of downstream tasks, proposing a unified framework that integrates both T2I models and various editing tasks. Furthermore, another key challenge is the efficient creation of high-quality editing data, particularly for instruction-based and drag-based editing. To this end, we develop a synthetic data pipeline using sticker-like elements to synthesize accurate, high-quality datasets efficiently, which enables editing data scaling up for unified model training. For training, DreamOmni jointly trains T2I generation and downstream tasks. T2I training enhances the model's understanding of specific concepts and improves generation quality, while editing training helps the model grasp the nuances of the editing task. This collaboration significantly boosts editing performance. Extensive experiments confirm the effectiveness of DreamOmni. The code and model will be released.

DreamOmni: Unified Image Generation and Editing

TL;DR

DreamOmni tackles the lack of a unified framework for image generation and editing by analyzing existing diffusion-model architectures and introducing a Vision-Language–conditioned DIT-based latent diffusion backbone. It pairs a synthetic collage data pipeline with multi-task training to scale high-quality editing data while preserving T2I generation quality, enabling efficient joint learning of generation and editing tasks. Empirical results demonstrate improved generation fidelity, editing accuracy, and robustness across instruction-based, drag, inpainting/outpainting, and reference-image tasks, with ablations showing fast convergence and the benefit of concentrating computations on higher-resolution latents. The approach offers a practical pathway to deploy and scale unified image generation and editing models, and the authors plan to release code and models.

Abstract

Currently, the success of large language models (LLMs) illustrates that a unified multitasking approach can significantly enhance model usability, streamline deployment, and foster synergistic benefits across different tasks. However, in computer vision, while text-to-image (T2I) models have significantly improved generation quality through scaling up, their framework design did not initially consider how to unify with downstream tasks, such as various types of editing. To address this, we introduce DreamOmni, a unified model for image generation and editing. We begin by analyzing existing frameworks and the requirements of downstream tasks, proposing a unified framework that integrates both T2I models and various editing tasks. Furthermore, another key challenge is the efficient creation of high-quality editing data, particularly for instruction-based and drag-based editing. To this end, we develop a synthetic data pipeline using sticker-like elements to synthesize accurate, high-quality datasets efficiently, which enables editing data scaling up for unified model training. For training, DreamOmni jointly trains T2I generation and downstream tasks. T2I training enhances the model's understanding of specific concepts and improves generation quality, while editing training helps the model grasp the nuances of the editing task. This collaboration significantly boosts editing performance. Extensive experiments confirm the effectiveness of DreamOmni. The code and model will be released.

Paper Structure

This paper contains 8 sections, 1 equation, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The gallery of DreamOmni. DreamOmni, as a native unified image generation and editing model, can handle various tasks.
  • Figure 2: The overview of DreamOmni. (a) The DreamOmni framework supports unified image generation and editing, with fast training convergence and powerful performance. (b) To overcome the difficulty and inefficiency in data creation and filtering for image editing, we propose a collage-based synthetic data pipeline. This pipeline enables the efficient creation of data for various editing tasks, such as adding, deleting, and replacement operations in instruction-based editing, as well as translation, scaling, and rotation in drag editing. Additionally, it supports reference image generation and segmentation $\&$ detection. Furthermore, our synthetic data pipeline enhances the accuracy of T2I generation. Due to space limitations, we have optionally shown the corresponding prompts or instructions for these cases.
  • Figure 3: Comparison of different frameworks. The left figure shows the FID comparison among different frameworks, while the right table shows their number of parameters and runtime.
  • Figure 4: Visual comparison on T2I generation. Compared to other competitive methods (including SD3-Medium sd3, SDXL sdxl, SD-Cascade, and SD1.5 LDM), our DreamOmni not only better adheres to user prompts but also generate more visually appealing results with delicate details, elegant composition, and so on.
  • Figure 5: Visual comparison on inpainting $\&$ outpainting between DreamOmni, ControlNet-Inpainting controlnet and SD-inpainting LDM.
  • ...and 4 more figures