Table of Contents
Fetching ...

InvarDiff: Cross-Scale Invariance Caching for Accelerated Diffusion Models

Zihao Wu

TL;DR

InvarDiff introduces a training-free cross-scale caching method that exploits cross-timestep and cross-layer invariances in deterministic diffusion-model sampling to accelerate inference. It derives a binary cache plan and a step-level gate via a two-phase calibration with resampling correction, and applies a step-first then layer-wise scheduling to reuse computations without retraining or architectural changes. The approach yields substantial end-to-end speedups (up to around 3×) on DiT-family backbones like FLUX and DiT-XL/2 while preserving perceptual quality, and adapts to DiT-style variants. The results demonstrate robust, transferable acceleration that complements existing speedup techniques and can be extended to high-resolution image and video generation pipelines.

Abstract

Diffusion models deliver high-fidelity synthesis but remain slow due to iterative sampling. We empirically observe there exists feature invariance in deterministic sampling, and present InvarDiff, a training-free acceleration method that exploits the relative temporal invariance across timestep-scale and layer-scale. From a few deterministic runs, we compute a per-timestep, per-layer, per-module binary cache plan matrix and use a re-sampling correction to avoid drift when consecutive caches occur. Using quantile-based change metrics, this matrix specifies which module at which step is reused rather than recomputed. The same invariance criterion is applied at the step scale to enable cross-timestep caching, deciding whether an entire step can reuse cached results. During inference, InvarDiff performs step-first and layer-wise caching guided by this matrix. When applied to DiT and FLUX, our approach reduces redundant compute while preserving fidelity. Experiments show that InvarDiff achieves $2$-$3\times$ end-to-end speed-ups with minimal impact on standard quality metrics. Qualitatively, we observe almost no degradation in visual quality compared with full computations.

InvarDiff: Cross-Scale Invariance Caching for Accelerated Diffusion Models

TL;DR

InvarDiff introduces a training-free cross-scale caching method that exploits cross-timestep and cross-layer invariances in deterministic diffusion-model sampling to accelerate inference. It derives a binary cache plan and a step-level gate via a two-phase calibration with resampling correction, and applies a step-first then layer-wise scheduling to reuse computations without retraining or architectural changes. The approach yields substantial end-to-end speedups (up to around 3×) on DiT-family backbones like FLUX and DiT-XL/2 while preserving perceptual quality, and adapts to DiT-style variants. The results demonstrate robust, transferable acceleration that complements existing speedup techniques and can be extended to high-resolution image and video generation pipelines.

Abstract

Diffusion models deliver high-fidelity synthesis but remain slow due to iterative sampling. We empirically observe there exists feature invariance in deterministic sampling, and present InvarDiff, a training-free acceleration method that exploits the relative temporal invariance across timestep-scale and layer-scale. From a few deterministic runs, we compute a per-timestep, per-layer, per-module binary cache plan matrix and use a re-sampling correction to avoid drift when consecutive caches occur. Using quantile-based change metrics, this matrix specifies which module at which step is reused rather than recomputed. The same invariance criterion is applied at the step scale to enable cross-timestep caching, deciding whether an entire step can reuse cached results. During inference, InvarDiff performs step-first and layer-wise caching guided by this matrix. When applied to DiT and FLUX, our approach reduces redundant compute while preserving fidelity. Experiments show that InvarDiff achieves - end-to-end speed-ups with minimal impact on standard quality metrics. Qualitatively, we observe almost no degradation in visual quality compared with full computations.

Paper Structure

This paper contains 37 sections, 4 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Our method achieves 3.31$\times$ speedup on FLUX.1-dev (A800, 28 steps) and 2.86$\times$ on DiT-XL/2 (RTX 4070S, 50 steps).
  • Figure 2: Temporal invariance in DiT under 50-step sampling. The horizontal axis is the inference timestep (the reverse-ordered timesteps), not the training diffusion time. For each layer $l$ and timestep $t>0$, the heatmaps show the change (log2 scale) between adjacent inference steps for $s\in\{\mathrm{MHSA},\mathrm{FFN}\}$: (a) $\mathrm{MSE}\!(Z^{(s)}_{l,t},\,Z^{(s)}_{l,t-1})$ and (b) $\cos\!\angle\!(Z^{(s)}_{l,t},\,Z^{(s)}_{l,t-1})$. The first column ($t{=}0$) is set to $0$. Values are averaged over inputs from 10 distinct class labels; the per-(timestep, layer) patterns closely match single-class maps (see Appendix), supporting a global threshold for cache planning over (timestep, layer, module).
  • Figure 3: Cross-scale caching schematic. We exploit two scales of reuse: (i) across timesteps (step-level reuse) and (ii) within a timestep across modules (layer-wise reuse of MHSA/FFN). The scheduler first tests a step-level gate; if reuse is unsafe, it traverses layers and selectively reuses or recomputes modules according to the cache plan. Dashed boxes indicate reused modules; red arcs depict cross-timestep reuse.
  • Figure 4: Average rate matrices $\rho$ (MHSA/FFN) over 10 class labels. The horizontal axis denotes inference timesteps (test-time sampling from $t$ to $t{-}1$), not the training diffusion time. The first and the last timestep are set to $1$ for visualization. Axes: Timestep (x) and Layer (y). See §\ref{['sec:Methodology']} for the definition of $\rho$.
  • Figure 5: Cross-class stability of the rate matrices $\rho$. Using DiT class labels $0$–$99$ to form a reference rate matrix $R_{\text{ref}}^{(s)}$ for each module $s\!\in\!\{\mathrm{MHSA},\mathrm{FFN}\}$, we compute $\mathrm{MSE}\!\left(R_c^{(s)},R_{\text{ref}}^{(s)}\right)$ for $c\!=\!100,\ldots,999$ (DiT class labels on the $x$-axis). As shown in Fig. \ref{['fig:rate_matrix']}, most entries of $\rho$ lie in the $0.9$–$1.5$ band, and the curves here remain low and flat, indicating that $(\text{timestep},\text{layer})$ patterns of $\rho$ are largely class-independent. This supports using a single global quantile threshold to derive a binary cache plan.
  • ...and 11 more figures