Table of Contents
Fetching ...

Lightning Fast Caching-based Parallel Denoising Prediction for Accelerating Talking Head Generation

Jianzhi Long, Wenhao Sun, Rongcheng Tu, Dacheng Tao

TL;DR

This work tackles the inefficiency of diffusion-based talking head generation by introducing a training-free acceleration framework that exploits task-specific redundancies. The core ideas are Lightning-fast Caching-based Parallel denoising Prediction (LightningCP), which caches high-level decoder features to enable parallel, reduced-pass denoising, and Decoupled Foreground Attention (DFA), which localizes attention to the dynamic foreground while reusing stable background features. The method achieves substantial speedups (up to around 3.15× on some models) with minimal or no loss in video quality, validated across multiple models and datasets, and complemented by input latent estimation and optional reference feature removal. These contributions offer practical, plug-in improvements for real-time or near-real-time diffusion-based talking head generation in realistic settings.

Abstract

Diffusion-based talking head models generate high-quality, photorealistic videos but suffer from slow inference, limiting practical applications. Existing acceleration methods for general diffusion models fail to exploit the temporal and spatial redundancies unique to talking head generation. In this paper, we propose a task-specific framework addressing these inefficiencies through two key innovations. First, we introduce Lightning-fast Caching-based Parallel denoising prediction (LightningCP), caching static features to bypass most model layers in inference time. We also enable parallel prediction using cached features and estimated noisy latents as inputs, efficiently bypassing sequential sampling. Second, we propose Decoupled Foreground Attention (DFA) to further accelerate attention computations, exploiting the spatial decoupling in talking head videos to restrict attention to dynamic foreground regions. Additionally, we remove reference features in certain layers to bring extra speedup. Extensive experiments demonstrate that our framework significantly improves inference speed while preserving video quality.

Lightning Fast Caching-based Parallel Denoising Prediction for Accelerating Talking Head Generation

TL;DR

This work tackles the inefficiency of diffusion-based talking head generation by introducing a training-free acceleration framework that exploits task-specific redundancies. The core ideas are Lightning-fast Caching-based Parallel denoising Prediction (LightningCP), which caches high-level decoder features to enable parallel, reduced-pass denoising, and Decoupled Foreground Attention (DFA), which localizes attention to the dynamic foreground while reusing stable background features. The method achieves substantial speedups (up to around 3.15× on some models) with minimal or no loss in video quality, validated across multiple models and datasets, and complemented by input latent estimation and optional reference feature removal. These contributions offer practical, plug-in improvements for real-time or near-real-time diffusion-based talking head generation in realistic settings.

Abstract

Diffusion-based talking head models generate high-quality, photorealistic videos but suffer from slow inference, limiting practical applications. Existing acceleration methods for general diffusion models fail to exploit the temporal and spatial redundancies unique to talking head generation. In this paper, we propose a task-specific framework addressing these inefficiencies through two key innovations. First, we introduce Lightning-fast Caching-based Parallel denoising prediction (LightningCP), caching static features to bypass most model layers in inference time. We also enable parallel prediction using cached features and estimated noisy latents as inputs, efficiently bypassing sequential sampling. Second, we propose Decoupled Foreground Attention (DFA) to further accelerate attention computations, exploiting the spatial decoupling in talking head videos to restrict attention to dynamic foreground regions. Additionally, we remove reference features in certain layers to bring extra speedup. Extensive experiments demonstrate that our framework significantly improves inference speed while preserving video quality.

Paper Structure

This paper contains 25 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Analysis of the feature $f_{U_{31}}$ across timesteps in the Hallo model: (a) The $L_2$ Distance of the feature $f_{U_{31}}$ between consecutive timesteps, (b) The cosine similarity matrix of the feature $f_{U_{31}}$ between all timesteps.
  • Figure 2: Average attention scores for foreground noisy latent tokens in the reference attention module ($U_{32}$) of Hallo, showing attention correlations to (a) FG and BG noisy latent features and (b) FG and BG reference features. The $L_2$ distance of the BG attention output features in the upsampling layer $U_{32}$ between consecutive timesteps: (c) reference attention, (d) audio attention, and (e) temporal attention. FG: foreground. BG: background. std: standard deviation.
  • Figure 3: The pipeline of the accelerated talking head model. At key timestep $t$, we perform full model inference and cache feature $f_{U_{31}}$. At non-key timesteps $t-1$ and $t-2$, we reuse cached $f_{U_{31}}$ and bypass the encoder ($D_{0}, D_{1}, D_{2}, D_{3}$), midblock $M$, and all of the decoder ($U_{0}, U_{1}, U_{2}, U_{3}$) except its last layer $U_{32}$. Moreover, denoising prediction at $t-1$ and $t-2$ can be executed in parallel and further accelerated through decoupled foreground attention.
  • Figure 4: $L_2$ distance between consecutive timesteps for: (a) input latents, (b) predicted noise. $t_{thresh}$ is the threshold timestep after which input latents estimation is applied.
  • Figure 5: Qualitative result on HDTF and MEAD datasets using the Hallo model. From top to bottom: results from the base model, DeepCache, FasterDiffusion, our proposed method, and Ground Truth (GT).