Table of Contents
Fetching ...

ODE$_t$(ODE$_l$): Shortcutting the Time and the Length in Diffusion and Flow Models for Faster Sampling

Denis Gudovskiy, Wenzhao Zheng, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer

TL;DR

The paper tackles the high sampling cost of continuous normalizing flows and diffusion models by introducing ODE_t(ODE_l), which treats the inner network as a discretized ODE over depth (length $l$) while keeping the outer time ODE solver-agnostic. A length-consistency training objective, together with architectural rewiring (residuals and length-embedded blocks), enables dynamic depth during sampling without substantial overhead. Empirical results on CelebA-HQ-256 and ImageNet-256 show up to a $2\times$ reduction in latency and up to $2.8$ FID point improvements in high-quality regimes, with adaptive-step solvers further boosting performance. The approach is complementary to existing NFE minimization methods and is openly released at $github.com/gudovskiy/odelt$ to encourage broad adoption and further development.

Abstract

Continuous normalizing flows (CNFs) and diffusion models (DMs) generate high-quality data from a noise distribution. However, their sampling process demands multiple iterations to solve an ordinary differential equation (ODE) with high computational complexity. State-of-the-art methods focus on reducing the number of discrete time steps during sampling to improve efficiency. In this work, we explore a complementary direction in which the quality-complexity tradeoff can also be controlled in terms of the neural network length. We achieve this by rewiring the blocks in the transformer-based architecture to solve an inner discretized ODE w.r.t. its depth. Then, we apply a length consistency term during flow matching training, and as a result, the sampling can be performed with an arbitrary number of time steps and transformer blocks. Unlike others, our ODE$_t$(ODE$_l$) approach is solver-agnostic in time dimension and reduces both latency and, importantly, memory usage. CelebA-HQ and ImageNet generation experiments show a latency reduction of up to $2\times$ in the most efficient sampling mode, and FID improvement of up to $2.8$ points for high-quality sampling when applied to prior methods. We open-source our code and checkpoints at github.com/gudovskiy/odelt.

ODE$_t$(ODE$_l$): Shortcutting the Time and the Length in Diffusion and Flow Models for Faster Sampling

TL;DR

The paper tackles the high sampling cost of continuous normalizing flows and diffusion models by introducing ODE_t(ODE_l), which treats the inner network as a discretized ODE over depth (length ) while keeping the outer time ODE solver-agnostic. A length-consistency training objective, together with architectural rewiring (residuals and length-embedded blocks), enables dynamic depth during sampling without substantial overhead. Empirical results on CelebA-HQ-256 and ImageNet-256 show up to a reduction in latency and up to FID point improvements in high-quality regimes, with adaptive-step solvers further boosting performance. The approach is complementary to existing NFE minimization methods and is openly released at to encourage broad adoption and further development.

Abstract

Continuous normalizing flows (CNFs) and diffusion models (DMs) generate high-quality data from a noise distribution. However, their sampling process demands multiple iterations to solve an ordinary differential equation (ODE) with high computational complexity. State-of-the-art methods focus on reducing the number of discrete time steps during sampling to improve efficiency. In this work, we explore a complementary direction in which the quality-complexity tradeoff can also be controlled in terms of the neural network length. We achieve this by rewiring the blocks in the transformer-based architecture to solve an inner discretized ODE w.r.t. its depth. Then, we apply a length consistency term during flow matching training, and as a result, the sampling can be performed with an arbitrary number of time steps and transformer blocks. Unlike others, our ODE(ODE) approach is solver-agnostic in time dimension and reduces both latency and, importantly, memory usage. CelebA-HQ and ImageNet generation experiments show a latency reduction of up to in the most efficient sampling mode, and FID improvement of up to points for high-quality sampling when applied to prior methods. We open-source our code and checkpoints at github.com/gudovskiy/odelt.

Paper Structure

This paper contains 12 sections, 8 equations, 7 figures, 7 tables, 2 algorithms.

Figures (7)

  • Figure 1: The conventional approach (top) models the interpolated vector field $u_t ({\bm{x}} | {\bm{z}})$ by an expressive but monolithic neural network $v_{\bm{\theta}} (t, {\bm{x}}_t)$. This limits practitioners to adjust the quality-complexity tradeoff only in the integral's time dimension. Our ODE$_t$(ODE$_l$) approach (bottom) considers the neural network as the inner $\textrm{ODE}_l$ and allows to select the number of active blocks and, hence, reduce latency and memory usage during sampling.
  • Figure 2: Visualization of ODE$_t$(ODE$_l$) when it is applied to the shortcut time shortcuts. Our approach models the target vector field using the configurable $v_{\bm{\theta}}(l, d, t, {\bm{x}}_t)$ neural network. The length hyperparameter $l$ defines the number of active blocks within the architecture i.e. length shortcuts. The time shortcuts are adjusted by the hyperparameter $d$ as proposed by shortcut. Then, several cases can be highlighted: a) $v_{\bm{\theta}}(L, 0, t, {\bm{x}}_t)$ is equivalent to the conventional CFM/DM processing, b) $v_{\bm{\theta}}(L, d, t, {\bm{x}}_t)$ is identical to the time-only shortcuts (e.g., $d=1$ for single-step sampling), c) $v_{\bm{\theta}}(l, 0, t, {\bm{x}}_t)$ with only length shortcuts supports any ODE solver including more advanced ones with adaptive steps dormand1980familyzheng2023dpmsolvervfrankel2025ss, and d) $v_{\bm{\theta}}(l, d, t, {\bm{x}}_t)$ is the general setup with length and time shortcuts.
  • Figure 3: DiT-B CelebA sampling. Time shortcuts (SM) shortcut (first row) alter the image style, our length shortcuts (columns) preserve the style and iteratively compress the image details.
  • Figure 4: SiT-XL ImageNet sampling. Compared to MF mf (first row), our ODE$_t$(ODE$_{l=20}$) with 30% less compute in the last row contains moderately less semantic details when NFE$=1$.
  • Figure 5: CelebA FID vs. DiT-B latency. ODE$_t$(ODE$_l$) scales better than shortcut: it shows up to $2\times$ reduction in latency with Euler solver in compute-optimized mode and provides up to $2.8$ lower FID score with adaptive-step solver for high-quality sampling.
  • ...and 2 more figures