Table of Contents
Fetching ...

Relational Feature Caching for Accelerating Diffusion Transformers

Byunggwan Son, Jeimin Jeon, Jeongwoo Choi, Bumsub Ham

TL;DR

Rel relational feature caching (RFC) is proposed, a novel framework that leverages the input-output relationship to enhance the accuracy of the feature prediction and introduces relational feature estimation (RFE) to estimate the magnitude of changes in the output features from the inputs, enabling more accurate feature predictions.

Abstract

Feature caching approaches accelerate diffusion transformers (DiTs) by storing the output features of computationally expensive modules at certain timesteps, and exploiting them for subsequent steps to reduce redundant computations. Recent forecasting-based caching approaches employ temporal extrapolation techniques to approximate the output features with cached ones. Although effective, relying exclusively on temporal extrapolation still suffers from significant prediction errors, leading to performance degradation. Through a detailed analysis, we find that 1) these errors stem from the irregular magnitude of changes in the output features, and 2) an input feature of a module is strongly correlated with the corresponding output. Based on this, we propose relational feature caching (RFC), a novel framework that leverages the input-output relationship to enhance the accuracy of the feature prediction. Specifically, we introduce relational feature estimation (RFE) to estimate the magnitude of changes in the output features from the inputs, enabling more accurate feature predictions. We also present relational cache scheduling (RCS), which estimates the prediction errors using the input features and performs full computations only when the errors are expected to be substantial. Extensive experiments across various DiT models demonstrate that RFC consistently outperforms prior approaches significantly. Project page is available at https://cvlab.yonsei.ac.kr/projects/RFC

Relational Feature Caching for Accelerating Diffusion Transformers

TL;DR

Rel relational feature caching (RFC) is proposed, a novel framework that leverages the input-output relationship to enhance the accuracy of the feature prediction and introduces relational feature estimation (RFE) to estimate the magnitude of changes in the output features from the inputs, enabling more accurate feature predictions.

Abstract

Feature caching approaches accelerate diffusion transformers (DiTs) by storing the output features of computationally expensive modules at certain timesteps, and exploiting them for subsequent steps to reduce redundant computations. Recent forecasting-based caching approaches employ temporal extrapolation techniques to approximate the output features with cached ones. Although effective, relying exclusively on temporal extrapolation still suffers from significant prediction errors, leading to performance degradation. Through a detailed analysis, we find that 1) these errors stem from the irregular magnitude of changes in the output features, and 2) an input feature of a module is strongly correlated with the corresponding output. Based on this, we propose relational feature caching (RFC), a novel framework that leverages the input-output relationship to enhance the accuracy of the feature prediction. Specifically, we introduce relational feature estimation (RFE) to estimate the magnitude of changes in the output features from the inputs, enabling more accurate feature predictions. We also present relational cache scheduling (RCS), which estimates the prediction errors using the input features and performs full computations only when the errors are expected to be substantial. Extensive experiments across various DiT models demonstrate that RFC consistently outperforms prior approaches significantly. Project page is available at https://cvlab.yonsei.ac.kr/projects/RFC
Paper Structure (39 sections, 1 theorem, 20 equations, 12 figures, 13 tables)

This paper contains 39 sections, 1 theorem, 20 equations, 12 figures, 13 tables.

Key Result

Proposition 1

Assume that the mapping from input to output features is locally linear, and the direction of the difference vector $\Delta_k I(t-k)$ remains constant for $1 \le k \le N$, where $N$ is an interval between full computations. Then, the ratio $s_k(t-k)$ is approximately invariant w.r.t. $k$.

Figures (12)

  • Figure 1: Feature analysis and comparison between existing approaches (FORA selvaraju2024fora, TaylorSeer liu2025reusing), and our method (RFC) using DiT-XL/2 peebles2023scalable. (a-b) Min-max normalized $L_2$ distances of output and input features, measured between consecutive timesteps. While the variations of feature changes are irregular, those of input and output remain closely aligned with each other. (c) The prediction errors across different modules. We measure the relative $L_1$ error between output features with and without applying caching methods and average the values over the timesteps. (d) Quantitative results on ImageNet deng2009imagenet evaluated in terms of FLOPs and sFID nash2021generating.
  • Figure 2: Empirical analyses in DiT-XL/2 peebles2023scalable. (a) RSD of $s_k(t-k)$ with varying $t$. (b) Relative $L_1$ errors of the output and input features, i.e., $\mathcal{E}_O(t)$ and $\mathcal{E}_I(t)$, respectively. Please see the text for details.
  • Figure 3: Qualitative comparisons of (left) class-conditional image generation for DiT-XL/2 peebles2023scalable on ImageNet deng2009imagenet, and (right) text-to-image generation for FLUX.1 dev labs2025flux1kontextflowmatching on DrawBench saharia2022photorealistic.
  • Figure 4: Qualitative comparisons of text-to-video generation for HunyuanVideo kong2024hunyuanvideo on VBench huang2024vbench. Please see the supplementary material for the actual video clips.
  • Figure 5: Feature analyses using DiT-XL/2 peebles2023scalable on ImageNet deng2009imagenet. (a) A linearity analysis of diffusion features. The solid lines present the linearity between the input and output features, while the dashed lines show the linearity between the timesteps and output features. (b-c) The directional consistency of the input and output features.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof