Table of Contents
Fetching ...

Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

Chaolong Yang, Kai Yao, Yuyao Yan, Chenru Jiang, Weiguang Zhao, Jie Sun, Guangliang Cheng, Yifei Zhang, Bin Dong, Kaizhu Huang

TL;DR

KDTalker addresses the limitations of existing audio-driven talking portraits by unifying unsupervised implicit 3D keypoints with a spatiotemporal diffusion model guided by reference priors. The method uses a spatiotemporal attention mechanism to capture long-range audio-keypoint dependencies, enabling accurate lip synchronization and diverse, natural head poses with real-time efficiency. Extensive experiments on VoxCeleb and HDTF demonstrate state-of-the-art lip sync, pose diversity, and inference speed, supported by thorough ablations and user studies. This approach advances practical, high-fidelity talking portraits for VR, digital humans, and filmmaking by balancing identity preservation, detail, and computational efficiency.

Abstract

Audio-driven single-image talking portrait generation plays a crucial role in virtual reality, digital human creation, and filmmaking. Existing approaches are generally categorized into keypoint-based and image-based methods. Keypoint-based methods effectively preserve character identity but struggle to capture fine facial details due to the fixed points limitation of the 3D Morphable Model. Moreover, traditional generative networks face challenges in establishing causality between audio and keypoints on limited datasets, resulting in low pose diversity. In contrast, image-based approaches produce high-quality portraits with diverse details using the diffusion network but incur identity distortion and expensive computational costs. In this work, we propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with a spatiotemporal diffusion model. Leveraging unsupervised implicit 3D keypoints, KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly. The custom-designed spatiotemporal attention mechanism ensures accurate lip synchronization, producing temporally consistent, high-quality animations while enhancing computational efficiency. Experimental results demonstrate that KDTalker achieves state-of-the-art performance regarding lip synchronization accuracy, head pose diversity, and execution efficiency.Our codes are available at https://github.com/chaolongy/KDTalker.

Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

TL;DR

KDTalker addresses the limitations of existing audio-driven talking portraits by unifying unsupervised implicit 3D keypoints with a spatiotemporal diffusion model guided by reference priors. The method uses a spatiotemporal attention mechanism to capture long-range audio-keypoint dependencies, enabling accurate lip synchronization and diverse, natural head poses with real-time efficiency. Extensive experiments on VoxCeleb and HDTF demonstrate state-of-the-art lip sync, pose diversity, and inference speed, supported by thorough ablations and user studies. This approach advances practical, high-fidelity talking portraits for VR, digital humans, and filmmaking by balancing identity preservation, detail, and computational efficiency.

Abstract

Audio-driven single-image talking portrait generation plays a crucial role in virtual reality, digital human creation, and filmmaking. Existing approaches are generally categorized into keypoint-based and image-based methods. Keypoint-based methods effectively preserve character identity but struggle to capture fine facial details due to the fixed points limitation of the 3D Morphable Model. Moreover, traditional generative networks face challenges in establishing causality between audio and keypoints on limited datasets, resulting in low pose diversity. In contrast, image-based approaches produce high-quality portraits with diverse details using the diffusion network but incur identity distortion and expensive computational costs. In this work, we propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with a spatiotemporal diffusion model. Leveraging unsupervised implicit 3D keypoints, KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly. The custom-designed spatiotemporal attention mechanism ensures accurate lip synchronization, producing temporally consistent, high-quality animations while enhancing computational efficiency. Experimental results demonstrate that KDTalker achieves state-of-the-art performance regarding lip synchronization accuracy, head pose diversity, and execution efficiency.Our codes are available at https://github.com/chaolongy/KDTalker.

Paper Structure

This paper contains 23 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The proposed KDTalker: A keypoint-based spatiotemporal diffusion framework that generates synchronized, high-fidelity talking videos from audio and a single image, enhancing pose diversity and expression detail with realistic, temporally consistent animations.
  • Figure 2: Inference Time vs Head Diversity $\&$ LSE-D. The value of LSE-D (Lip Sync Error Distance), a metric quantifying the alignment between lip movements and audio, is represented by the size of the circle. A smaller circle indicates a lower LSE-D value, reflecting better lip sync performance.
  • Figure 3: Overview of the proposed KDTalker for talking portrait synthesis.
  • Figure 4: Reference-Guided Priors.
  • Figure 5: Spatiotemporal-Aware Attention Network.
  • ...and 2 more figures