Table of Contents
Fetching ...

AI killed the video star. Audio-driven diffusion model for expressive talking head generation

Baptiste Chopin, Tashvik Dhamija, Pranav Balaji, Yaohui Wang, Antitza Dantcheva

TL;DR

Dimitra++ tackles audio-driven talking head generation by combining a 3DMM-based motion representation with a diffusion-based transformer (cMDT) conditioned on both a reference image and audio. It emphasizes disentangled modeling of lip motion, facial expression, and head pose using three separate diffusion models, complemented by a 3DMM-to-RGB video renderer. Across VoxCeleb2 and CelebV-HQ, Dimitra++ achieves state-of-the-art quantitative and qualitative results with strong user-preference gains, while highlighting limitations of current evaluation metrics. The work also provides a detailed evaluation protocol and dataset processing guidelines to enable fair comparisons and future research, with a clear path toward real-time or higher-resolution rendering and broader expression control.

Abstract

We propose Dimitra++, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we propose a conditional Motion Diffusion Transformer (cMDT) to model facial motion sequences, employing a 3D representation. The cMDT is conditioned on two inputs: a reference facial image, which determines appearance, as well as an audio sequence, which drives the motion. Quantitative and qualitative experiments, as well as a user study on two widely employed datasets, i.e., VoxCeleb2 and CelebV-HQ, suggest that Dimitra++ is able to outperform existing approaches in generating realistic talking heads imparting lip motion, facial expression, and head pose.

AI killed the video star. Audio-driven diffusion model for expressive talking head generation

TL;DR

Dimitra++ tackles audio-driven talking head generation by combining a 3DMM-based motion representation with a diffusion-based transformer (cMDT) conditioned on both a reference image and audio. It emphasizes disentangled modeling of lip motion, facial expression, and head pose using three separate diffusion models, complemented by a 3DMM-to-RGB video renderer. Across VoxCeleb2 and CelebV-HQ, Dimitra++ achieves state-of-the-art quantitative and qualitative results with strong user-preference gains, while highlighting limitations of current evaluation metrics. The work also provides a detailed evaluation protocol and dataset processing guidelines to enable fair comparisons and future research, with a clear path toward real-time or higher-resolution rendering and broader expression control.

Abstract

We propose Dimitra++, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we propose a conditional Motion Diffusion Transformer (cMDT) to model facial motion sequences, employing a 3D representation. The cMDT is conditioned on two inputs: a reference facial image, which determines appearance, as well as an audio sequence, which drives the motion. Quantitative and qualitative experiments, as well as a user study on two widely employed datasets, i.e., VoxCeleb2 and CelebV-HQ, suggest that Dimitra++ is able to outperform existing approaches in generating realistic talking heads imparting lip motion, facial expression, and head pose.

Paper Structure

This paper contains 23 sections, 5 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Dimitra++ pipeline. Dimitra++ comprises three main parts, a Motion Modeling Module, a Conditional Motion Diffusion Transformer (cMDT) and a Video Renderer. In the training stage, 3D meshes (3DMM) are extracted from a video by the Motion Modeling Model. They are used by the cMDT jointly with features extracted from an audio sequence, to noise then denoise the 3DMM sequence. In the inference stage, using an audio sequence and an identity 3DMM as condition, cMDT aims at generating a 3DMM sequence from Gaussian noise. Finally, the Video Renderer transforms the 3DMM sequence into an RGB video.
  • Figure 2: Conditional Motion Diffusion Transformer (cMDT). cMDT takes an audio sequence and a 3DMM frame as condition using two different encoders, i.e., ID Encoder and Audio Encoder. A transformer diffusion decoder is applied to denoise a sequence of noises to facial motions based on input conditions.
  • Figure 3: Examples of generated samples pertained to the VoxCeleb2 dataset.
  • Figure 4: Examples of generated samples of the CVHQ dataset.
  • Figure 5: Comparison with examples of closed source methods.
  • ...and 4 more figures