Table of Contents
Fetching ...

Transition Models: Rethinking the Generative Learning Objective

Zidong Wang, Yiyuan Zhang, Xiaoyu Yue, Xiangyu Yue, Yangguang Li, Wanli Ouyang, Lei Bai

TL;DR

Transition Models (TiM) address the enduring trade-off in generative modeling between many refinement steps and high output fidelity by learning arbitrary-state transitions across any time interval using a new state-transition identity.TiM replaces traditional PF-ODE supervision with a global trajectory-consistent objective, facilitated by the Differential Derivation Equation for efficient, scalable training and augmented by architectural innovations such as decoupled time embeddings and interval-aware attention.Empirically, a compact 865M-parameter TiM achieves state-of-the-art results across NFEs and image resolutions, including 4096x4096, and exhibits monotonic quality gains as the sampling budget increases, outperforming larger diffusion models on GenEval and MJHQ benchmarks.The work demonstrates a practical, scalable path toward versatile, high-fidelity image generation from scratch, unifying few-step efficiency with many-step refinement.

Abstract

A fundamental dilemma in generative modeling persists: iterative diffusion models achieve outstanding fidelity, but at a significant computational cost, while efficient few-step alternatives are constrained by a hard quality ceiling. This conflict between generation steps and output quality arises from restrictive training objectives that focus exclusively on either infinitesimal dynamics (PF-ODEs) or direct endpoint prediction. We address this challenge by introducing an exact, continuous-time dynamics equation that analytically defines state transitions across any finite time interval. This leads to a novel generative paradigm, Transition Models (TiM), which adapt to arbitrary-step transitions, seamlessly traversing the generative trajectory from single leaps to fine-grained refinement with more steps. Despite having only 865M parameters, TiM achieves state-of-the-art performance, surpassing leading models such as SD3.5 (8B parameters) and FLUX.1 (12B parameters) across all evaluated step counts. Importantly, unlike previous few-step generators, TiM demonstrates monotonic quality improvement as the sampling budget increases. Additionally, when employing our native-resolution strategy, TiM delivers exceptional fidelity at resolutions up to 4096x4096.

Transition Models: Rethinking the Generative Learning Objective

TL;DR

Transition Models (TiM) address the enduring trade-off in generative modeling between many refinement steps and high output fidelity by learning arbitrary-state transitions across any time interval using a new state-transition identity.TiM replaces traditional PF-ODE supervision with a global trajectory-consistent objective, facilitated by the Differential Derivation Equation for efficient, scalable training and augmented by architectural innovations such as decoupled time embeddings and interval-aware attention.Empirically, a compact 865M-parameter TiM achieves state-of-the-art results across NFEs and image resolutions, including 4096x4096, and exhibits monotonic quality gains as the sampling budget increases, outperforming larger diffusion models on GenEval and MJHQ benchmarks.The work demonstrates a practical, scalable path toward versatile, high-fidelity image generation from scratch, unifying few-step efficiency with many-step refinement.

Abstract

A fundamental dilemma in generative modeling persists: iterative diffusion models achieve outstanding fidelity, but at a significant computational cost, while efficient few-step alternatives are constrained by a hard quality ceiling. This conflict between generation steps and output quality arises from restrictive training objectives that focus exclusively on either infinitesimal dynamics (PF-ODEs) or direct endpoint prediction. We address this challenge by introducing an exact, continuous-time dynamics equation that analytically defines state transitions across any finite time interval. This leads to a novel generative paradigm, Transition Models (TiM), which adapt to arbitrary-step transitions, seamlessly traversing the generative trajectory from single leaps to fine-grained refinement with more steps. Despite having only 865M parameters, TiM achieves state-of-the-art performance, surpassing leading models such as SD3.5 (8B parameters) and FLUX.1 (12B parameters) across all evaluated step counts. Importantly, unlike previous few-step generators, TiM demonstrates monotonic quality improvement as the sampling budget increases. Additionally, when employing our native-resolution strategy, TiM delivers exceptional fidelity at resolutions up to 4096x4096.

Paper Structure

This paper contains 41 sections, 45 equations, 6 figures, 14 tables, 2 algorithms.

Figures (6)

  • Figure 1: TiM's superior performance across different NFEs, resolutions, and aspect ratios. On the GenEval ghosh2023geneval benchmark, TiM outperforms Flux.1 models flux.1-dev2024flux.1-sch2024 at different NFEs (top, $1024\times1024$), at higher resolutions (middle, $1024\times1024$ to $4096\times 4096$), and diverse aspect ratios (bottom, $2:5$ to $5:2$).
  • Figure 2: Illustration of Different Generative Paradigms. While conventional diffusion models learn the local vector field and few-step models learn a fixed endpoint map (a single large step), our Transition Models (TiM) are trained to master arbitrary state-to-state transitions. This approach allows TiM to learn the entire solution manifold of the generative process, unifying the few-step and many-step regimes within a single, powerful model.
  • Figure 3: Qualitative Analysis between TiM and existing methods under different NFEs. TiM delivers superior fidelity and text alignment across all NFEs. In contrast, multi-step diffusion and few-step distilled models exhibit pronounced step–quality trade-offs: SDXL, SD3.5-Large, and FLUX.1-Dev fail to generate images at low NFEs, while SDXL-Turbo, SD3.5-Turbo, and FLUX.1-Schnell produce over-saturated outputs at high NFEs.
  • Figure 4: TiM Model Architecture.
  • Figure 5: TiM T2I block.
  • ...and 1 more figures