Table of Contents
Fetching ...

STORK: Faster Diffusion And Flow Matching Sampling By Resolving Both Stiffness And Structure-Dependence

Zheng Tan, Weizhen Wang, Andrea L. Bertozzi, Ernest K. Ryu

TL;DR

This work tackles the bottleneck of slow sampling in diffusion and flow-matching models by addressing stiffness and structure-dependence in a single training-free solver. The authors introduce Stabilized Taylor Orthogonal Runge--Kutta (STORK), a stiff, structure-independent SRK-based method that uses Taylor-expanded virtual NFEs to achieve high-order accuracy with reduced NFEs for both noise-based and flow-based ODEs. Empirical results across image and video generation show that STORK outperforms state-of-the-art training-free samplers (e.g., DPM-Solver++, UniPC) on unconditional and conditional tasks at low NFEs, including challenging video generation scenarios. The approach promises practical impact by enabling faster, high-fidelity sampling for large diffusion and flow-matching models without additional training or model modifications, with broad applicability to real-time and resource-constrained generation settings.

Abstract

Diffusion models (DMs) and flow-matching models have demonstrated remarkable performance in image and video generation. However, such models require a significant number of function evaluations (NFEs) during sampling, leading to costly inference. Consequently, quality-preserving fast sampling methods that require fewer NFEs have been an active area of research. However, prior training-free sampling methods fail to simultaneously address two key challenges: the stiffness of the ODE (i.e., the non-straightness of the velocity field) and dependence on the semi-linear structure of the DM ODE (which limits their direct applicability to flow-matching models). In this work, we introduce the Stabilized Taylor Orthogonal Runge--Kutta (STORK) method, addressing both design concerns. We demonstrate that STORK consistently improves the quality of diffusion and flow-matching sampling for image and video generation. Code is available at https://github.com/ZT220501/STORK.

STORK: Faster Diffusion And Flow Matching Sampling By Resolving Both Stiffness And Structure-Dependence

TL;DR

This work tackles the bottleneck of slow sampling in diffusion and flow-matching models by addressing stiffness and structure-dependence in a single training-free solver. The authors introduce Stabilized Taylor Orthogonal Runge--Kutta (STORK), a stiff, structure-independent SRK-based method that uses Taylor-expanded virtual NFEs to achieve high-order accuracy with reduced NFEs for both noise-based and flow-based ODEs. Empirical results across image and video generation show that STORK outperforms state-of-the-art training-free samplers (e.g., DPM-Solver++, UniPC) on unconditional and conditional tasks at low NFEs, including challenging video generation scenarios. The approach promises practical impact by enabling faster, high-fidelity sampling for large diffusion and flow-matching models without additional training or model modifications, with broad applicability to real-time and resource-constrained generation settings.

Abstract

Diffusion models (DMs) and flow-matching models have demonstrated remarkable performance in image and video generation. However, such models require a significant number of function evaluations (NFEs) during sampling, leading to costly inference. Consequently, quality-preserving fast sampling methods that require fewer NFEs have been an active area of research. However, prior training-free sampling methods fail to simultaneously address two key challenges: the stiffness of the ODE (i.e., the non-straightness of the velocity field) and dependence on the semi-linear structure of the DM ODE (which limits their direct applicability to flow-matching models). In this work, we introduce the Stabilized Taylor Orthogonal Runge--Kutta (STORK) method, addressing both design concerns. We demonstrate that STORK consistently improves the quality of diffusion and flow-matching sampling for image and video generation. Code is available at https://github.com/ZT220501/STORK.

Paper Structure

This paper contains 41 sections, 2 theorems, 28 equations, 15 figures, 14 tables, 2 algorithms.

Key Result

Theorem 1

Assume $\bm{\epsilon}_\theta(\bm{x}_t, t)$ or $\bm{v}(\bm{x}_t, t)$ satisfies the assumptions in Appendix. Let $\{\Tilde{\bm{x}}_{t_i}\}_{i=0}^M$ be the sequence computed by STORK-$k$ with timesteps $\{t_i\}_{i=0}^M$. For $k=2, 4$, if Taylor expansion is not used for the virtual NFEs, the STORK-$k$

Figures (15)

  • Figure 1: Comparison between the Flow-Euler, Flow-DPM-Solver++ DPM-solver++Sana, Flow-UniPC UniPC, and STORK. All images are generated using the SANA 1.6B model Sana at $1024\times 1024$ resolution with only 8 NFEs. Prompts are displayed beneath each image pair, accompanied by our commentary explaining why STORK's generations are superior. STORK achieves much better visual fidelity at the extremely low NFE case, showing its effectiveness as a fast sampling method. Zoom in for better visual details.
  • Figure 2: Video generation on Hunyuan model kong2024hunyuanvideo with prompt: "Iron Man is walking towards the camera in the rain at night, with a lot of fog behind him. Science fiction movie, close-up". Our video portrays Iron Man more clearly and has rain in the background.
  • Figure 3: Illustration of NFE evaluations for STORK-4 with $s=4$ and $1$st order Taylor approximation. For presentation clarity, we use uniform timesteps with size $h$. "NFE" denotes actual NFEs, while the "virtual NFE" denotes NFEs approximated with the Taylor expansion. The arrows indicate that the previously computed velocity is used for first derivative approximations. Euler's method is used for the first step since there is no previous velocity.
  • Figure 4: Derived from the SRK method, STORK is both a stiff problem solver and a structure-independent solver.
  • Figure 5: Sample quality measured by FID $\downarrow$ for unconditional (Uncond) and classifier-free-guided (CFG) generation with noise-prediction models. As shown, STORK constantly outperforms other methods across datasets and image scales.
  • ...and 10 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 1
  • proof : Proof of Theorem \ref{['thm:STORK_convergence']}