Table of Contents
Fetching ...

USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning

Shaojin Wu, Mengqi Huang, Yufeng Cheng, Wenxu Wu, Jiahe Tian, Yiming Luo, Fei Ding, Qian He

TL;DR

USO tackles the long-standing separation between style-driven and subject-driven image generation by introducing a cross-task co-disentanglement framework. It couples a cross-task triplet data curation pipeline with a two-stage USO training process—Style Alignment Training and Content-Style Disentanglement Training—augmented by a Style Reward Learning objective. The authors release USO-Bench to jointly evaluate subject fidelity and style similarity across tasks and demonstrate state-of-the-art performance on subject-driven, style-driven, and joint style-subject-driven generation, validated by quantitative metrics and user studies. This work highlights the benefits of mutual cross-task supervision for precise feature disentanglement and flexible composition of subjects and styles in diffusion-based generation, with code and models publicly available.

Abstract

Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model's performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO

USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning

TL;DR

USO tackles the long-standing separation between style-driven and subject-driven image generation by introducing a cross-task co-disentanglement framework. It couples a cross-task triplet data curation pipeline with a two-stage USO training process—Style Alignment Training and Content-Style Disentanglement Training—augmented by a Style Reward Learning objective. The authors release USO-Bench to jointly evaluate subject fidelity and style similarity across tasks and demonstrate state-of-the-art performance on subject-driven, style-driven, and joint style-subject-driven generation, validated by quantitative metrics and user studies. This work highlights the benefits of mutual cross-task supervision for precise feature disentanglement and flexible composition of subjects and styles in diffusion-based generation, with code and models publicly available.

Abstract

Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model's performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO

Paper Structure

This paper contains 26 sections, 7 equations, 22 figures, 4 tables, 1 algorithm.

Figures (22)

  • Figure 1: Showcase of the versatile abilities of the USO model. Prompts are in \ref{['tab:teaser']}.
  • Figure 2: Illustration of our motivation. By jointly disentangling content and style across tasks, we unify style-driven and subject-driven generation within a single framework.
  • Figure 3: Illustration of our proposed cross-task triplet curation framework, which systematically generates layout-preserved and layout-shifted triplets.
  • Figure 4: Illustration of the training framework of USO. USO unifies subject-driven and style-driven generation in two stages: Stage 1 aligns SigLIP embeddings via style-alignment training to yield a style-capable model; Stage 2 disentangles the conditional encoders and trains on triplets to enable the joint conditional generation. Finally, a style-reward learning paradigm supervises both stages to yield a stronger unified model.
  • Figure 5: Qualitative comparison with different methods on subject-driven generation.
  • ...and 17 more figures