Table of Contents
Fetching ...

H2-Cache: A Novel Hierarchical Dual-Stage Cache for High-Performance Acceleration of Generative Diffusion Models

Mingyu Sung, Il-Min Kim, Sangseok Yun, Jae-Mo Kang

TL;DR

Diffusion models achieve high-fidelity image generation but suffer from heavy iterative denoising. The authors present H2-cache, a hierarchical dual-stage caching framework that splits the denoising network into a structure-defining stage and a detail-refining stage, applying independent thresholds and a lightweight similarity estimator called Pooled Feature Summarization to maintain image quality while accelerating inference. Across Flux-based architectures, H2-cache yields up to 5.08× speedups with near-baseline perceptual quality, outperforming existing caching methods in both speed and fidelity. This approach provides a practical pathway to real-time, high-fidelity diffusion-based generation on standard hardware, with open-source code available for broader adoption.

Abstract

Diffusion models have emerged as state-of-the-art in image generation, but their practical deployment is hindered by the significant computational cost of their iterative denoising process. While existing caching techniques can accelerate inference, they often create a challenging trade-off between speed and fidelity, suffering from quality degradation and high computational overhead. To address these limitations, we introduce H2-Cache, a novel hierarchical caching mechanism designed for modern generative diffusion model architectures. Our method is founded on the key insight that the denoising process can be functionally separated into a structure-defining stage and a detail-refining stage. H2-cache leverages this by employing a dual-threshold system, using independent thresholds to selectively cache each stage. To ensure the efficiency of our dual-check approach, we introduce pooled feature summarization (PFS), a lightweight technique for robust and fast similarity estimation. Extensive experiments on the Flux architecture demonstrate that H2-cache achieves significant acceleration (up to 5.08x) while maintaining image quality nearly identical to the baseline, quantitatively and qualitatively outperforming existing caching methods. Our work presents a robust and practical solution that effectively resolves the speed-quality dilemma, significantly lowering the barrier for the real-world application of high-fidelity diffusion models. Source code is available at https://github.com/Bluear7878/H2-cache-A-Hierarchical-Dual-Stage-Cache.

H2-Cache: A Novel Hierarchical Dual-Stage Cache for High-Performance Acceleration of Generative Diffusion Models

TL;DR

Diffusion models achieve high-fidelity image generation but suffer from heavy iterative denoising. The authors present H2-cache, a hierarchical dual-stage caching framework that splits the denoising network into a structure-defining stage and a detail-refining stage, applying independent thresholds and a lightweight similarity estimator called Pooled Feature Summarization to maintain image quality while accelerating inference. Across Flux-based architectures, H2-cache yields up to 5.08× speedups with near-baseline perceptual quality, outperforming existing caching methods in both speed and fidelity. This approach provides a practical pathway to real-time, high-fidelity diffusion-based generation on standard hardware, with open-source code available for broader adoption.

Abstract

Diffusion models have emerged as state-of-the-art in image generation, but their practical deployment is hindered by the significant computational cost of their iterative denoising process. While existing caching techniques can accelerate inference, they often create a challenging trade-off between speed and fidelity, suffering from quality degradation and high computational overhead. To address these limitations, we introduce H2-Cache, a novel hierarchical caching mechanism designed for modern generative diffusion model architectures. Our method is founded on the key insight that the denoising process can be functionally separated into a structure-defining stage and a detail-refining stage. H2-cache leverages this by employing a dual-threshold system, using independent thresholds to selectively cache each stage. To ensure the efficiency of our dual-check approach, we introduce pooled feature summarization (PFS), a lightweight technique for robust and fast similarity estimation. Extensive experiments on the Flux architecture demonstrate that H2-cache achieves significant acceleration (up to 5.08x) while maintaining image quality nearly identical to the baseline, quantitatively and qualitatively outperforming existing caching methods. Our work presents a robust and practical solution that effectively resolves the speed-quality dilemma, significantly lowering the barrier for the real-world application of high-fidelity diffusion models. Source code is available at https://github.com/Bluear7878/H2-cache-A-Hierarchical-Dual-Stage-Cache.

Paper Structure

This paper contains 21 sections, 16 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the proposed H2-cache framework. The pipeline consists of a hierarchical, two-stage caching mechanism applied to the structure-defining ($\mathcal{B}_{L1}$) and detail-refining ($\mathcal{B}_{L2}$) stages. The right panel details the hit detection mechanism, which employs Pooled Feature Summarization for efficient similarity checks.
  • Figure 2: Qualitative analysis of H2-cache. Caching the global structure stage, $\mathcal{B}_{L1}$ (bottom row), freezes the overall pose and layout. In contrast, caching the detail-refining stage, $\mathcal{B}_{L2}$ (middle row), preserves fine-grained textures while allowing the global structure to evolve, demonstrating a clear functional separation compared to standard block caching (top row).
  • Figure 3: Our method (H2-cache) is compared against three baselines: no caching (Baseline), block cache, and TeaCache. For each method, we report the inference time and the reference-free CLIP-IQA($\uparrow$) score wang2023exploring.
  • Figure 4: Quantitative evaluation of performance across various image quality metrics (PSNR, SSIM, FID, and CLIP-IQA) by varying the structural threshold ($\tau_1$) and the detail threshold ($\tau_2$). The red stars indicate the best-performing hyperparameter set for each metric.
  • Figure 5: Qualitative comparison of caching behavior with and without PFS.