Table of Contents
Fetching ...

Duality Models: An Embarrassingly Simple One-step Generation Paradigm

Peng Sun, Xinyi Shang, Tao Lin, Zhiqiang Shen

TL;DR

Duality Models (DuMo) is proposed, using a shared backbone with dual heads, which applies geometric constraints from the multi-step objective to every sample, bounding the few-step estimation without separating training objectives, thereby significantly improving stability and efficiency.

Abstract

Consistency-based generative models like Shortcut and MeanFlow achieve impressive results via a target-aware design for solving the Probability Flow ODE (PF-ODE). Typically, such methods introduce a target time $r$ alongside the current time $t$ to modulate outputs between a local multi-step derivative ($r = t$) and a global few-step integral ($r = 0$). However, the conventional "one input, one output" paradigm enforces a partition of the training budget, often allocating a significant portion (e.g., 75% in MeanFlow) solely to the multi-step objective for stability. This separation forces a trade-off: allocating sufficient samples to the multi-step objective leaves the few-step generation undertrained, which harms convergence and limits scalability. To this end, we propose Duality Models (DuMo) via a "one input, dual output" paradigm. Using a shared backbone with dual heads, DuMo simultaneously predicts velocity $v_t$ and flow-map $u_t$ from a single input $x_t$. This applies geometric constraints from the multi-step objective to every sample, bounding the few-step estimation without separating training objectives, thereby significantly improving stability and efficiency. On ImageNet 256 $\times$ 256, a 679M Diffusion Transformer with SD-VAE achieves a state-of-the-art (SOTA) FID of 1.79 in just 2 steps. Code is available at: https://github.com/LINs-lab/DuMo

Duality Models: An Embarrassingly Simple One-step Generation Paradigm

TL;DR

Duality Models (DuMo) is proposed, using a shared backbone with dual heads, which applies geometric constraints from the multi-step objective to every sample, bounding the few-step estimation without separating training objectives, thereby significantly improving stability and efficiency.

Abstract

Consistency-based generative models like Shortcut and MeanFlow achieve impressive results via a target-aware design for solving the Probability Flow ODE (PF-ODE). Typically, such methods introduce a target time alongside the current time to modulate outputs between a local multi-step derivative () and a global few-step integral (). However, the conventional "one input, one output" paradigm enforces a partition of the training budget, often allocating a significant portion (e.g., 75% in MeanFlow) solely to the multi-step objective for stability. This separation forces a trade-off: allocating sufficient samples to the multi-step objective leaves the few-step generation undertrained, which harms convergence and limits scalability. To this end, we propose Duality Models (DuMo) via a "one input, dual output" paradigm. Using a shared backbone with dual heads, DuMo simultaneously predicts velocity and flow-map from a single input . This applies geometric constraints from the multi-step objective to every sample, bounding the few-step estimation without separating training objectives, thereby significantly improving stability and efficiency. On ImageNet 256 256, a 679M Diffusion Transformer with SD-VAE achieves a state-of-the-art (SOTA) FID of 1.79 in just 2 steps. Code is available at: https://github.com/LINs-lab/DuMo
Paper Structure (57 sections, 1 theorem, 10 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 57 sections, 1 theorem, 10 equations, 10 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

The training of consistency-based models can be analyzed via a surrogate loss function $\mathcal{L}(\boldsymbol{\theta}, \lambda)$. Here, $\lambda \in (0, 1)$ denotes the Consistency Ratio. This objective asymptotically recovers flow matching models as $\lambda \to 0$ and approximates consistency mo where $\mathbf{x}_t = t \cdot \mathbf{z} + (1 - t) \cdot \mathbf{x}$ and $\mathbf{z}_t = \mathbf{z}

Figures (10)

  • Figure 1: Schematic comparison. (Top) Existing one-step models geng2025meancheng2025twinflow adopt a "one input, one output" scheme, predicting either velocity $\mathbf{v}_{t}$ or flow-map $\mathbf{u}_{t}$ conditioned on $r$. (Bottom) DuMo employs a "one input, dual output" design to simultaneously predict $\mathbf{v}_{t}$ and $\mathbf{u}_{t}$ from $\mathbf{x}_{t}$. This incurs negligible overhead ($<\mathbf{0.5\%}$, e.g., +3M on a 675M DiT) by adding only an output head while preserving the backbone.
  • Figure 2: Conceptual illustration of learning paradigms. Trajectories illustrate the mapping from a current prediction state to a target learning state. (a) Multi-step methods, such as standard diffusion ho2020denoising and flow-matching lipman2022flow, learn the local derivative (velocity). (b) Prominent one-step approaches, including consistency models song2023consistency, MeanFlow geng2025mean, and Shortcut models frans2024one, learn the global integral (flow-map). (c)DuMo unifies these paradigms via a dual-output architecture, establishing a natural combination of the geometric constraints from (a) and the generative efficiency of (b).
  • Figure 3: Impact of Velocity Ratio ($\rho$) on one-step generation using MeanFlow geng2025mean. We visualize samples generated on the Moons dataset to analyze the behavior of this representative single-branch model. Gray areas: Ground truth distribution; Red dots: Samples generated via one-step inference. We report Maximum Mean Discrepancy (MMD) to quantify quality (lower is better). The results highlight a rigid trade-off: low velocity supervision ($\rho=0$) leads to instability/divergence, while excessive supervision ($\rho=1.0$) hampers the learning of the few-step mapping. Optimal performance is achieved at $\rho=0.8$, confirming that MeanFlow requires a specific partition of the training budget to balance stability and efficacy.
  • Figure 4: Ablation studies of DuMo on ImageNet-1K $256\!\times\!256$. We investigate key design factors of DuMo for one-step generation, benchmarking against the single-branch baseline, MeanFlow. Unless noted otherwise, experiments in (a) and (c) employ a DiT-B/2 backbone trained for 10K iterations with a learning rate of $2 \times 10^{-4}$, adhering to the optimization protocol in \ref{['sec:expset']}.
  • Figure 5: Visualization of sampling trajectories at the baseline state (0 training steps). Initially, the model requires significant computational steps to resolve coherent images. The leftmost columns (low NFE) remain unstructured and blurry, with high-fidelity results only emerging in the rightmost columns, indicating a highly curved generation trajectory typical of the teacher model.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Theorem 1: Surrogate objective for unified linear case ($\alpha(t) = t, \ \gamma(t) = 1-t$), see sun2025unified