Table of Contents
Fetching ...

Let Features Decide Their Own Solvers: Hybrid Feature Caching for Diffusion Transformers

Shikang Zheng, Guantao Chen, Qinming Zhou, Yuqi Lin, Lixuan He, Chang Zou, Peiliang Cai, Jiacheng Liu, Linfeng Zhang

TL;DR

Diffusion Transformers are powerful but bottlenecked by per-timestep forward passes. HyCa reframes feature caching as a hybrid ODE solving problem by clustering feature dimensions according to their dynamic behavior and assigning specialized solvers per cluster, enabling offline solver selection and online per-cluster solving without retraining. Across text-to-image, text-to-video, and image editing, HyCa delivers substantial speedups (up to ≈6×) with negligible loss in quality and remains compatible with distillation. This provides a practical, training-free pathway to deploy diffusion transformers in latency-constrained settings while preserving high fidelity.

Abstract

Diffusion Transformers offer state-of-the-art fidelity in image and video synthesis, but their iterative sampling process remains a major bottleneck due to the high cost of transformer forward passes at each timestep. To mitigate this, feature caching has emerged as a training-free acceleration technique that reuses or forecasts hidden representations. However, existing methods often apply a uniform caching strategy across all feature dimensions, ignoring their heterogeneous dynamic behaviors. Therefore, we adopt a new perspective by modeling hidden feature evolution as a mixture of ODEs across dimensions, and introduce HyCa, a Hybrid ODE solver inspired caching framework that applies dimension-wise caching strategies. HyCa achieves near-lossless acceleration across diverse domains and models, including 5.55 times speedup on FLUX, 5.56 times speedup on HunyuanVideo, 6.24 times speedup on Qwen-Image and Qwen-Image-Edit without retraining.

Let Features Decide Their Own Solvers: Hybrid Feature Caching for Diffusion Transformers

TL;DR

Diffusion Transformers are powerful but bottlenecked by per-timestep forward passes. HyCa reframes feature caching as a hybrid ODE solving problem by clustering feature dimensions according to their dynamic behavior and assigning specialized solvers per cluster, enabling offline solver selection and online per-cluster solving without retraining. Across text-to-image, text-to-video, and image editing, HyCa delivers substantial speedups (up to ≈6×) with negligible loss in quality and remains compatible with distillation. This provides a practical, training-free pathway to deploy diffusion transformers in latency-constrained settings while preserving high fidelity.

Abstract

Diffusion Transformers offer state-of-the-art fidelity in image and video synthesis, but their iterative sampling process remains a major bottleneck due to the high cost of transformer forward passes at each timestep. To mitigate this, feature caching has emerged as a training-free acceleration technique that reuses or forecasts hidden representations. However, existing methods often apply a uniform caching strategy across all feature dimensions, ignoring their heterogeneous dynamic behaviors. Therefore, we adopt a new perspective by modeling hidden feature evolution as a mixture of ODEs across dimensions, and introduce HyCa, a Hybrid ODE solver inspired caching framework that applies dimension-wise caching strategies. HyCa achieves near-lossless acceleration across diverse domains and models, including 5.55 times speedup on FLUX, 5.56 times speedup on HunyuanVideo, 6.24 times speedup on Qwen-Image and Qwen-Image-Edit without retraining.

Paper Structure

This paper contains 16 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Images generated on Qwen-Image with HyCa at 6.24$\times$ acceleration.
  • Figure 2: Feature trajectory clusters and stability of assignments. (a--b) Cluster 1 shows oscillatory trajectories while Cluster 2 shows smooth ones. (c--d) ARI distributions on Hunyuan Video and Qwen-Image exceed 0.8 in most cases, confirming stable and consistent cluster assignments across prompts and timesteps. An ARI above 0.8 indicates strong agreement and high clustering reliability.
  • Figure 3: HyCa Framework. (a) Offline Preprocessing: feature dimensions are first analyzed and clustered with temporal indicators (e.g., differences, curvature). For each cluster, candidate solvers generate predicted features, then compared against real computed features; the solver with minimum error is then assigned to that cluster. (b) Inference: once assigned, each cluster consistently reuses its solver, enabling efficient prediction by skipping redundant computations while maintaining accuracy.
  • Figure 4: Visual comparison of 5.5$\times$ accelerated FLUX.
  • Figure 5: Visual comparison of different caching method on Qwen-Image-Edit.
  • ...and 2 more figures