Table of Contents
Fetching ...

Fair Benchmarking of Emerging One-Step Generative Models Against Multistep Diffusion and Flow Models

Advaith Ravishankar, Serena Liu, Mingyang Wang, Todd Zhou, Jeffrey Zhou, Arnav Sharma, Ziling Hu, Léopold Das, Abdulaziz Sobirov, Faizaan Siddique, Freddy Yu, Seungjoo Baek, Yan Luo, Mengyu Wang

Abstract

State-of-the-art text-to-image models produce high-quality images, but inference remains expensive as generation requires several sequential ODE or denoising steps. Native one-step models aim to reduce this cost by mapping noise to an image in a single step, yet fair comparisons to multi-step systems are difficult because studies use mismatched sampling steps and different classifier-free guidance (CFG) settings, where CFG can shift FID, Inception Score, and CLIP-based alignment in opposing directions. It is also unclear how well one-step models scale to multi-step inference, and there is limited standardized out-of-distribution evaluation for label-ID-conditioned generators beyond ImageNet. To address this, We benchmark eight models spanning one-step flows (MeanFlow, Improved MeanFlow, SoFlow), multi-step baselines (RAE, Scale-RAE), and established systems (SiT, Stable Diffusion 3.5, FLUX.1) under a controlled class-conditional protocol on ImageNet validation, ImageNetV2, and reLAIONet, our new proofread out-of-distribution dataset aligned to ImageNet label IDs. Using FID, Inception Score, CLIP Score, and Pick Score, we show that FID-focused model development and CFG selection can be misleading in few-step regimes, where guidance changes can improve FID while degrading text-image alignment and human preference signals and worsening perceived quality. We further show that leading one-step models benefit from step scaling and become substantially more competitive under multi-step inference, although they still exhibit characteristic local distortions. To capture these tradeoffs, we introduce MinMax Harmonic Mean (MMHM), a composite proxy over all four metrics that stabilizes hyperparameter selection across guidance and step sweeps.

Fair Benchmarking of Emerging One-Step Generative Models Against Multistep Diffusion and Flow Models

Abstract

State-of-the-art text-to-image models produce high-quality images, but inference remains expensive as generation requires several sequential ODE or denoising steps. Native one-step models aim to reduce this cost by mapping noise to an image in a single step, yet fair comparisons to multi-step systems are difficult because studies use mismatched sampling steps and different classifier-free guidance (CFG) settings, where CFG can shift FID, Inception Score, and CLIP-based alignment in opposing directions. It is also unclear how well one-step models scale to multi-step inference, and there is limited standardized out-of-distribution evaluation for label-ID-conditioned generators beyond ImageNet. To address this, We benchmark eight models spanning one-step flows (MeanFlow, Improved MeanFlow, SoFlow), multi-step baselines (RAE, Scale-RAE), and established systems (SiT, Stable Diffusion 3.5, FLUX.1) under a controlled class-conditional protocol on ImageNet validation, ImageNetV2, and reLAIONet, our new proofread out-of-distribution dataset aligned to ImageNet label IDs. Using FID, Inception Score, CLIP Score, and Pick Score, we show that FID-focused model development and CFG selection can be misleading in few-step regimes, where guidance changes can improve FID while degrading text-image alignment and human preference signals and worsening perceived quality. We further show that leading one-step models benefit from step scaling and become substantially more competitive under multi-step inference, although they still exhibit characteristic local distortions. To capture these tradeoffs, we introduce MinMax Harmonic Mean (MMHM), a composite proxy over all four metrics that stabilizes hyperparameter selection across guidance and step sweeps.
Paper Structure (48 sections, 3 theorems, 1 equation, 9 figures, 17 tables)

This paper contains 48 sections, 3 theorems, 1 equation, 9 figures, 17 tables.

Key Result

lemma 1

A caption containing an ambiguous lemma $\hat{\ell}$ with $|\{k : \hat{\ell} \in \hat{\mathcal{L}}_k\}| > 1$ cannot be unambiguously assigned to a single class and is discarded.

Figures (9)

  • Figure 1: Left: Example images from ImageNet, ImageNetV2, and reLAIONet for the same label IDs, illustrating the distribution shift of reLAIONet. Middle/Right: Qualitative samples from Scuba Diver and Sports Car classes, where lower FID still exhibit visible distortions (e.g., faces and local structure). The reported FID is a model-level score computed over the full evaluation set, so all samples from a given model share the same value; additional examples are in the supplement.
  • Figure 2: Within model family comparison for native-one step models. Images generated with classes Golden Retriever, GoldFish, Groom, Scuba Diver and Sports Car for Meanflow and iMF. Images are ordered in ascending order of MMHM.
  • Figure 3: Comparison of generation quality across all models. Images generated with classes Golden Retriever, GoldFish, Groom, Scuba Diver and Sports Car. Models are selected based on best MMHM for each family. All approaches have best generation at 25 steps. See one step image quality for all models in the supplemental.
  • Figure 4: Ablations for Meanflow across steps sizes of 1, 5, 10, 15, 20, 25 and CFG's of 1, 3, 6, 7, 9, 12, 15. We report a separate heatmaps for FID, IS, CLIP Score, Pick Score and MMHM
  • Figure 5: Comparison of generation quality across all models. Images generated with classes lighthouse, church, butterfly, pizza and cat. Models are selected based on best MMHM for each family.
  • ...and 4 more figures

Theorems & Definitions (3)

  • lemma 1: Uniqueness Exclusion
  • lemma 2: Multi-label Rejection
  • lemma 3: Threshold Monotonicity