Table of Contents
Fetching ...

IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation

Sejong Yang, Seoung Wug Oh, Yang Zhou, Seon Joo Kim

TL;DR

IF-MDM addresses high-fidelity talking head generation from a single image and audio by learning appearance-aware implicit motion via a two-stage diffusion framework. It disentangles appearance from motion and employs a diffusion transformer conditioned on speech alongside motion statistics to synthesize motion sequences for rendering, achieving real-time performance at 512 by 512 up to 45 fps. Compared with diffusion-based baselines and explicit face models, IF-MDM offers strong speed, competitive visual quality, and controllable motion through the motion mean and standard deviation without per-identity retraining. While it mitigates artifacts from warping-based methods and shows robust lip-sync and temporal consistency on HDTF and in-the-wild data, there remains room to bridge the gap with the best 3DMM-based lip-sync; the approach also emphasizes responsible deployment and ethical use. The authors provide code and supplementary materials to facilitate further research and application.

Abstract

We introduce a novel approach for high-resolution talking head generation from a single image and audio input. Prior methods using explicit face models, like 3D morphable models (3DMM) and facial landmarks, often fall short in generating high-fidelity videos due to their lack of appearance-aware motion representation. While generative approaches such as video diffusion models achieve high video quality, their slow processing speeds limit practical application. Our proposed model, Implicit Face Motion Diffusion Model (IF-MDM), employs implicit motion to encode human faces into appearance-aware compressed facial latents, enhancing video generation. Although implicit motion lacks the spatial disentanglement of explicit models, which complicates alignment with subtle lip movements, we introduce motion statistics to help capture fine-grained motion information. Additionally, our model provides motion controllability to optimize the trade-off between motion intensity and visual quality during inference. IF-MDM supports real-time generation of 512x512 resolution videos at up to 45 frames per second (fps). Extensive evaluations demonstrate its superior performance over existing diffusion and explicit face models. The code will be released publicly, available alongside supplementary materials. The video results can be found on https://bit.ly/ifmdm_supplementary.

IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation

TL;DR

IF-MDM addresses high-fidelity talking head generation from a single image and audio by learning appearance-aware implicit motion via a two-stage diffusion framework. It disentangles appearance from motion and employs a diffusion transformer conditioned on speech alongside motion statistics to synthesize motion sequences for rendering, achieving real-time performance at 512 by 512 up to 45 fps. Compared with diffusion-based baselines and explicit face models, IF-MDM offers strong speed, competitive visual quality, and controllable motion through the motion mean and standard deviation without per-identity retraining. While it mitigates artifacts from warping-based methods and shows robust lip-sync and temporal consistency on HDTF and in-the-wild data, there remains room to bridge the gap with the best 3DMM-based lip-sync; the approach also emphasizes responsible deployment and ethical use. The authors provide code and supplementary materials to facilitate further research and application.

Abstract

We introduce a novel approach for high-resolution talking head generation from a single image and audio input. Prior methods using explicit face models, like 3D morphable models (3DMM) and facial landmarks, often fall short in generating high-fidelity videos due to their lack of appearance-aware motion representation. While generative approaches such as video diffusion models achieve high video quality, their slow processing speeds limit practical application. Our proposed model, Implicit Face Motion Diffusion Model (IF-MDM), employs implicit motion to encode human faces into appearance-aware compressed facial latents, enhancing video generation. Although implicit motion lacks the spatial disentanglement of explicit models, which complicates alignment with subtle lip movements, we introduce motion statistics to help capture fine-grained motion information. Additionally, our model provides motion controllability to optimize the trade-off between motion intensity and visual quality during inference. IF-MDM supports real-time generation of 512x512 resolution videos at up to 45 frames per second (fps). Extensive evaluations demonstrate its superior performance over existing diffusion and explicit face models. The code will be released publicly, available alongside supplementary materials. The video results can be found on https://bit.ly/ifmdm_supplementary.

Paper Structure

This paper contains 19 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The main difference of Implicit Face Motion Diffusion Model (IF-MDM) compared to previous methods.
  • Figure 2: The inference pipeline of the Implicit Face Motion Diffusion Model (IF-MDM).
  • Figure 3: The training pipeline of our framework and the detailed architecture of the implicit motion generator.
  • Figure 4: The detailed architecture of the stage 1 models.
  • Figure 5: The qualitative results of HDTF datasets with baselines. The output of Real3DPortrait exhibits good lipsync quality, yet gives the impression of a floating head. The output of AniPortrait shows like heat haze in face and background when played. The video can be found on the supplementary files or https://bit.ly/ifmdm_supplementary#hdtf-title.
  • ...and 2 more figures