Table of Contents
Fetching ...

DreamO: A Unified Framework for Image Customization

Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, Mengtian Li, Mingcong Liu, Yi Zhang, Shaojin Wu, Songtao Zhao, Jian Zhang, Qian He, Xinglong Wu

TL;DR

<3-5 sentence high-level summary> This work addresses the lack of a unified framework for image customization across multiple condition types. It introduces DreamO, a DiT-based framework that encodes multiple condition inputs into a single conditioning sequence, augmented with routing constraints and a placeholder mechanism, trained progressively on a large, multi-task dataset. Key contributions include a cross-condition routing loss to disentangle features, a placeholder-to-condition alignment, and a three-stage training protocol that improves convergence while preserving the model priors. Experiments across identity, subject, try-on, and style tasks demonstrate high fidelity, robust text alignment, and effective multi-condition integration with a single model.

Abstract

Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge. In this paper, we present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions. Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types. During training, we construct a large-scale training dataset that includes various customization tasks, and we introduce a feature routing constraint to facilitate the precise querying of relevant information from reference images. Additionally, we design a placeholder strategy that associates specific placeholders with conditions at particular positions, enabling control over the placement of conditions in the generated results. Moreover, we employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data. Extensive experiments demonstrate that the proposed DreamO can effectively perform various image customization tasks with high quality and flexibly integrate different types of control conditions.

DreamO: A Unified Framework for Image Customization

TL;DR

<3-5 sentence high-level summary> This work addresses the lack of a unified framework for image customization across multiple condition types. It introduces DreamO, a DiT-based framework that encodes multiple condition inputs into a single conditioning sequence, augmented with routing constraints and a placeholder mechanism, trained progressively on a large, multi-task dataset. Key contributions include a cross-condition routing loss to disentangle features, a placeholder-to-condition alignment, and a three-stage training protocol that improves convergence while preserving the model priors. Experiments across identity, subject, try-on, and style tasks demonstrate high fidelity, robust text alignment, and effective multi-condition integration with a single model.

Abstract

Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge. In this paper, we present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions. Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types. During training, we construct a large-scale training dataset that includes various customization tasks, and we introduce a feature routing constraint to facilitate the precise querying of relevant information from reference images. Additionally, we design a placeholder strategy that associates specific placeholders with conditions at particular positions, enabling control over the placement of conditions in the generated results. Moreover, we employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data. Extensive experiments demonstrate that the proposed DreamO can effectively perform various image customization tasks with high quality and flexibly integrate different types of control conditions.

Paper Structure

This paper contains 18 sections, 5 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Overview of our proposed DreamO, which can uniformly handle commonly used consistency-aware generation control.
  • Figure 2: Visualization of cross-attention maps in subject-driven image generation. The first row shows results from a model trained without routing constraints, while the second row presents results from a model trained with routing constraints.
  • Figure 3: The progressive training pipeline of our method. Left column shows the three training stages of our method. Right column shows the generation capability after the training of each stage.
  • Figure 4: Visual comparison between our DreamO and other methods.
  • Figure 5: The ablation study of the placeholder-to-image routing constraint.
  • ...and 7 more figures