Table of Contents
Fetching ...

Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

Xinyin Ma, Gongfan Fang, Michael Bi Mi, Xinchao Wang

TL;DR

The paper tackles slow inference in diffusion-transformer architectures by introducing Learning-to-Cache (L2C), a differentiable router that learns which transformer layers to cache across timesteps without altering model parameters. By interpolating between cheap and expensive layer computations and formulating a differentiable optimization over layer-level caching (β) under a budget, L2C turns a combinatorial problem into a tractable one that yields a static computation graph for fast inference. Empirically, L2C achieves substantial speedups while maintaining near-original image quality, with up to 93.68% cacheable layers in U-ViT-H/2 and large gains over DDIM, DPM-Solver, and prior cache approaches across DiT and U-ViT models. The method reveals architecture-dependent caching patterns and provides insights into layer-wise importance across timesteps, presenting a practical approach to accelerated diffusion transformers for high-fidelity image synthesis.

Abstract

Diffusion Transformers have recently demonstrated unprecedented generative capabilities for various tasks. The encouraging results, however, come with the cost of slow inference, since each denoising step requires inference on a transformer model with a large scale of parameters. In this study, we make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through introducing a caching mechanism, can be readily removed even without updating the model parameters. In the case of U-ViT-H/2, for example, we may remove up to 93.68% of the computation in the cache steps (46.84% for all steps), with less than 0.01 drop in FID. To achieve this, we introduce a novel scheme, named Learning-to-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Specifically, by leveraging the identical structure of layers in transformers and the sequential nature of diffusion, we explore redundant computations between timesteps by treating each layer as the fundamental unit for caching. To address the challenge of the exponential search space in deep models for identifying layers to cache and remove, we propose a novel differentiable optimization objective. An input-invariant yet timestep-variant router is then optimized, which can finally produce a static computation graph. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-Solver, alongside prior cache-based methods at the same inference speed. Code is available at https://github.com/horseee/learning-to-cache

Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

TL;DR

The paper tackles slow inference in diffusion-transformer architectures by introducing Learning-to-Cache (L2C), a differentiable router that learns which transformer layers to cache across timesteps without altering model parameters. By interpolating between cheap and expensive layer computations and formulating a differentiable optimization over layer-level caching (β) under a budget, L2C turns a combinatorial problem into a tractable one that yields a static computation graph for fast inference. Empirically, L2C achieves substantial speedups while maintaining near-original image quality, with up to 93.68% cacheable layers in U-ViT-H/2 and large gains over DDIM, DPM-Solver, and prior cache approaches across DiT and U-ViT models. The method reveals architecture-dependent caching patterns and provides insights into layer-wise importance across timesteps, presenting a practical approach to accelerated diffusion transformers for high-fidelity image synthesis.

Abstract

Diffusion Transformers have recently demonstrated unprecedented generative capabilities for various tasks. The encouraging results, however, come with the cost of slow inference, since each denoising step requires inference on a transformer model with a large scale of parameters. In this study, we make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through introducing a caching mechanism, can be readily removed even without updating the model parameters. In the case of U-ViT-H/2, for example, we may remove up to 93.68% of the computation in the cache steps (46.84% for all steps), with less than 0.01 drop in FID. To achieve this, we introduce a novel scheme, named Learning-to-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Specifically, by leveraging the identical structure of layers in transformers and the sequential nature of diffusion, we explore redundant computations between timesteps by treating each layer as the fundamental unit for caching. To address the challenge of the exponential search space in deep models for identifying layers to cache and remove, we propose a novel differentiable optimization objective. An input-invariant yet timestep-variant router is then optimized, which can finally produce a static computation graph. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-Solver, alongside prior cache-based methods at the same inference speed. Code is available at https://github.com/horseee/learning-to-cache
Paper Structure (34 sections, 13 equations, 8 figures, 9 tables, 2 algorithms)

This paper contains 34 sections, 13 equations, 8 figures, 9 tables, 2 algorithms.

Figures (8)

  • Figure 1: (a) Generate 512$\times$512 images using DiT-XL/2, sampled by DDIM with 50 NFEs. (b) Generate 256$\times$256 images using U-ViT-H/2, sampled by DPM-Solver-2 with 50 NFEs.
  • Figure 2: Illustration of Learning-to-Cache. When a layer is activated, the calculation proceeds as usual. In contrast, when a layer is disabled, the computation of the non-residual path is bypassed, and the results from the previous step are utilized instead. The router $\boldsymbol{\beta}$ smoothly controls the transition between two endpoints $\boldsymbol{\epsilon}_\theta(\boldsymbol{x}_s, s)$ and $\boldsymbol{\epsilon}_\theta(\boldsymbol{x}_m, m)$.
  • Figure 3: Approximation Error for DiT and U-ViT in different timesteps and different layers
  • Figure 4: Speed-Quality Tradeoff for DiT-XL/2 and U-ViT-H/2 with 20 denosing steps as the basis. The dashed line indicates the performance without applying inference acceleration.
  • Figure 5: Learned Router $\boldsymbol{\beta}$ for DiT-XL/2 (Top) and U-ViT-H/2 (Bottom). Different caching patterns are observed in different types of diffusion transformers.
  • ...and 3 more figures