MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Seyeon Kim; Siyoon Jin; Jihye Park; Kihong Kim; Jiyoung Kim; Jisu Nam; Seungryong Kim

MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Seyeon Kim, Siyoon Jin, Jihye Park, Kihong Kim, Jiyoung Kim, Jisu Nam, Seungryong Kim

TL;DR

MoDiTalker tackles high-fidelity talking head generation by disentangling motion from video through a two-stage diffusion framework. It introduces Audio-to-Motion (AToM) to synthesize lip-synced facial motion from audio conditioned on an identity frame and Motion-to-Video (MToV) to render temporally coherent video from motion using tri-plane conditioning. The approach achieves state-of-the-art results on the HDTF benchmark, with extensive ablations and a user study confirming improvements in lip-sync accuracy, identity preservation, and video quality, while offering substantially faster sampling than prior diffusion methods. This motion-disentangled design enables better generalization, cross-identity robustness, and practical inference times for real-world applications.

Abstract

Conventional GAN-based models for talking head generation often suffer from limited quality and unstable training. Recent approaches based on diffusion models aimed to address these limitations and improve fidelity. However, they still face challenges, including extensive sampling times and difficulties in maintaining temporal consistency due to the high stochasticity of diffusion models. To overcome these challenges, we propose a novel motion-disentangled diffusion model for high-quality talking head generation, dubbed MoDiTalker. We introduce the two modules: audio-to-motion (AToM), designed to generate a synchronized lip motion from audio, and motion-to-video (MToV), designed to produce high-quality head video following the generated motion. AToM excels in capturing subtle lip movements by leveraging an audio attention mechanism. In addition, MToV enhances temporal consistency by leveraging an efficient tri-plane representation. Our experiments conducted on standard benchmarks demonstrate that our model achieves superior performance compared to existing models. We also provide comprehensive ablation studies and user study results.

MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

TL;DR

Abstract

Paper Structure (41 sections, 6 equations, 21 figures, 5 tables)

This paper contains 41 sections, 6 equations, 21 figures, 5 tables.

Introduction
Related Work
GAN-based talking head generation.
Diffusion-based talking head generation.
Video diffusion model.
Preliminary- Denoising Diffusion Model
Method
Overview
Audio-to-Motion (AToM) Diffusion Model
Architectural details.
Training.
Motion-to-Video (MToV) Diffusion Model
Architectural details.
Training encoders.
Training diffusion model.
...and 26 more sections

Figures (21)

Figure 1: We present the Motion-Disentangled diffusion model for high-fidelity Talking head generation, dubbed MoDiTalker. This framework generates high-quality talking head videos through a novel two-stage, motion-disentangled diffusion models.
Figure 2: Overall network architecture of MoDiTalker. Our framework consists of two distinct diffusion models: Audio-to-Motion (AToM) and Motion-to-Video (MToV). AToM aims to generate lip-synchronized facial landmarks, given an identity frame $x_{\mathrm{id}}$ and audio input $A$, as conditions. MToV generates high-fidelity talking head videos $\hat{X}_0$ using synthesized facial landmarks $L$ from AToM, identity frames $X_\mathrm{I}$, and pose frames $X_\mathrm{P}$ as conditions.
Figure 3: Overview of the Audio-to-Motion (AToM) diffusion model: (a) AToM is a transformer-based diffusion model that learns the residual between the initial landmark $l_\mathrm{init}$ and the landmark sequence, using the audio embedding $F_A$ and the initial landmark embedding $F_L$ as conditions. In addition, (b) we design AToM block to process lip-related (upper-half) and lip-unrelated (lower-half) landmarks separately, allowing the model to focus more on generating lip-related movements while preserving the facial shape of the speaker.
Figure 4: Qualitative comparison with previous works: We compare MoDiTalker with previous GAN-based methods, including Wav2Lip prajwal2020lip, PC-AVS zhou2021pose, MakeItTalk zhou2020makelttalk, and Audio2Head wang2021audio2head, as well as diffusion-based methods, including Diffused Heads stypulkowski2024diffused and DreamTalk ma2023dreamtalk.
Figure 5: Qualitative comparison with previous works on cross identity setting: We compare MoDiTalker with previous diffusion-based methods, including Diffused Headsstypulkowski2024diffused and DreamTalkma2023dreamtalk.
...and 16 more figures

MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

TL;DR

Abstract

MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (21)