Table of Contents
Fetching ...

TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation

Soumya Mazumdar, Vineet Kumar Rakesh

TL;DR

TempoSyncDiff, a reference-conditioned latent diffusion framework that explores few-step inference for efficient audio-driven talking-head generation, suggests that distilled diffusion models can retain much of the reconstruction behaviour of a stronger teacher while enabling substantially lower latency inference.

Abstract

Diffusion models have recently advanced photorealistic human synthesis, although practical talking-head generation (THG) remains constrained by high inference latency, temporal instability such as flicker and identity drift, and imperfect audio-visual alignment under challenging speech conditions. This paper introduces TempoSyncDiff, a reference-conditioned latent diffusion framework that explores few-step inference for efficient audio-driven talking-head generation. The approach adopts a teacher-student distillation formulation in which a diffusion teacher trained with a standard noise prediction objective guides a lightweight student denoiser capable of operating with significantly fewer inference steps to improve generation stability. The framework incorporates identity anchoring and temporal regularization designed to mitigate identity drift and frame-to-frame flicker during synthesis, while viseme-based audio conditioning provides coarse lip motion control. Experiments on the LRS3 dataset report denoising-stage component-level metrics relative to VAE reconstructions and preliminary latency characterization, including CPU-only and edge computing measurements and feasibility estimates for edge deployment. The results suggest that distilled diffusion models can retain much of the reconstruction behaviour of a stronger teacher while enabling substantially lower latency inference. The study is positioned as an initial step toward practical diffusion-based talking-head generation under constrained computational settings. GitHub: https://mazumdarsoumya.github.io/TempoSyncDiff

TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation

TL;DR

TempoSyncDiff, a reference-conditioned latent diffusion framework that explores few-step inference for efficient audio-driven talking-head generation, suggests that distilled diffusion models can retain much of the reconstruction behaviour of a stronger teacher while enabling substantially lower latency inference.

Abstract

Diffusion models have recently advanced photorealistic human synthesis, although practical talking-head generation (THG) remains constrained by high inference latency, temporal instability such as flicker and identity drift, and imperfect audio-visual alignment under challenging speech conditions. This paper introduces TempoSyncDiff, a reference-conditioned latent diffusion framework that explores few-step inference for efficient audio-driven talking-head generation. The approach adopts a teacher-student distillation formulation in which a diffusion teacher trained with a standard noise prediction objective guides a lightweight student denoiser capable of operating with significantly fewer inference steps to improve generation stability. The framework incorporates identity anchoring and temporal regularization designed to mitigate identity drift and frame-to-frame flicker during synthesis, while viseme-based audio conditioning provides coarse lip motion control. Experiments on the LRS3 dataset report denoising-stage component-level metrics relative to VAE reconstructions and preliminary latency characterization, including CPU-only and edge computing measurements and feasibility estimates for edge deployment. The results suggest that distilled diffusion models can retain much of the reconstruction behaviour of a stronger teacher while enabling substantially lower latency inference. The study is positioned as an initial step toward practical diffusion-based talking-head generation under constrained computational settings. GitHub: https://mazumdarsoumya.github.io/TempoSyncDiff
Paper Structure (16 sections, 12 equations, 6 figures, 3 tables, 3 algorithms)

This paper contains 16 sections, 12 equations, 6 figures, 3 tables, 3 algorithms.

Figures (6)

  • Figure 1: Overview of TempoSyncDiff. A reference identity image and speech audio are encoded into identity features and viseme tokens forming conditioning $\kappa_t$. A latent diffusion UNet denoises a noisy latent $\mathbf{z}^{\mathrm{noisy}}_t$ under this conditioning. During training, a teacher guides a distilled student enabling few-step sampling. The denoised latent $\mathbf{z}_t$ is decoded by a VAE to generate the talking-head frame $\hat{\mathbf{I}}_t$.
  • Figure 2: Teacher--student distillation used in TempoSyncDiff. A frozen teacher denoiser $\epsilon_{\theta}$ predicts the noise for a noisy latent $z_t$ conditioned on $\kappa_t=\{I_{\mathrm{ref}},v_t\}$. The student denoiser $\epsilon_{\phi}$ receives the same inputs and is trained to match the teacher via the distillation loss $\mathcal{L}_{dist}$. Optional identity and temporal regularizers can be applied during training.
  • Figure 3: Edge deployment-oriented inference. The student performs $K$ denoising steps in latent space. In E1 (full mode), the VAE decoder runs on-device and outputs frames. In E2 (hybrid mode), latents are returned for deferred decoding or external rendering.
  • Figure 4: Quality--latency trade-off across step counts (template). Added Raspberry Pi 5 edge inference latency (E2 Hybrid mode) to the latency axis. The intended behavior is near-teacher quality at substantially lower latency for small $K$.
  • Figure :
  • ...and 1 more figures