Table of Contents
Fetching ...

3DiFACE: Synthesizing and Editing Holistic 3D Facial Animation

Balamurugan Thambiraja, Malte Prinzler, Sadegh Aliakbarian, Darren Cosker, Justus Thies

TL;DR

3DiFACE introduces a diffusion-based framework for holistic 3D facial animation that jointly models lip and head motion from audio and enables editing via keyframes. It uses two fully convolutional diffusion models (facial and head) sharing a common audio encoder, with viseme-level training to exploit diversity and allow arbitrary-length synthesis. A subject-specific fine-tuning path enables speaking-style personalization, while a sparsely-guided diffusion (SGDiff) mechanism provides precise head-motion editing grounded in imputation signals. Across VOCAset, HDTF, and in-the-wild data, 3DiFACE achieves higher lip-sync fidelity and richer motion diversity, while supporting editing operations such as keyframing and interpolation, reducing production time for lifelike avatars. The work demonstrates strong quantitative gains in DivL/DivH and BA, with perceptual studies confirming superior naturalness and style preservation relative to state-of-the-art baselines, and discusses practical implications and ethical considerations for realistic avatar synthesis.

Abstract

Creating personalized 3D animations with precise control and realistic head motions remains challenging for current speech-driven 3D facial animation methods. Editing these animations is especially complex and time consuming, requires precise control and typically handled by highly skilled animators. Most existing works focus on controlling style or emotion of the synthesized animation and cannot edit/regenerate parts of an input animation. They also overlook the fact that multiple plausible lip and head movements can match the same audio input. To address these challenges, we present 3DiFACE, a novel method for holistic speech-driven 3D facial animation. Our approach produces diverse plausible lip and head motions for a single audio input and allows for editing via keyframing and interpolation. Specifically, we propose a fully-convolutional diffusion model that can leverage the viseme-level diversity in our training corpus. Additionally, we employ a speaking-style personalization and a novel sparsely-guided motion diffusion to enable precise control and editing. Through quantitative and qualitative evaluations, we demonstrate that our method is capable of generating and editing diverse holistic 3D facial animations given a single audio input, with control between high fidelity and diversity. Code and models are available here: https://balamuruganthambiraja.github.io/3DiFACE

3DiFACE: Synthesizing and Editing Holistic 3D Facial Animation

TL;DR

3DiFACE introduces a diffusion-based framework for holistic 3D facial animation that jointly models lip and head motion from audio and enables editing via keyframes. It uses two fully convolutional diffusion models (facial and head) sharing a common audio encoder, with viseme-level training to exploit diversity and allow arbitrary-length synthesis. A subject-specific fine-tuning path enables speaking-style personalization, while a sparsely-guided diffusion (SGDiff) mechanism provides precise head-motion editing grounded in imputation signals. Across VOCAset, HDTF, and in-the-wild data, 3DiFACE achieves higher lip-sync fidelity and richer motion diversity, while supporting editing operations such as keyframing and interpolation, reducing production time for lifelike avatars. The work demonstrates strong quantitative gains in DivL/DivH and BA, with perceptual studies confirming superior naturalness and style preservation relative to state-of-the-art baselines, and discusses practical implications and ethical considerations for realistic avatar synthesis.

Abstract

Creating personalized 3D animations with precise control and realistic head motions remains challenging for current speech-driven 3D facial animation methods. Editing these animations is especially complex and time consuming, requires precise control and typically handled by highly skilled animators. Most existing works focus on controlling style or emotion of the synthesized animation and cannot edit/regenerate parts of an input animation. They also overlook the fact that multiple plausible lip and head movements can match the same audio input. To address these challenges, we present 3DiFACE, a novel method for holistic speech-driven 3D facial animation. Our approach produces diverse plausible lip and head motions for a single audio input and allows for editing via keyframing and interpolation. Specifically, we propose a fully-convolutional diffusion model that can leverage the viseme-level diversity in our training corpus. Additionally, we employ a speaking-style personalization and a novel sparsely-guided motion diffusion to enable precise control and editing. Through quantitative and qualitative evaluations, we demonstrate that our method is capable of generating and editing diverse holistic 3D facial animations given a single audio input, with control between high fidelity and diversity. Code and models are available here: https://balamuruganthambiraja.github.io/3DiFACE

Paper Structure

This paper contains 40 sections, 12 equations, 10 figures, 10 tables, 2 algorithms.

Figures (10)

  • Figure 1: Illustration of holistic 3D facial motion editing with and without 3DiFACE. Head motion editing is shown in (a) and (b), where one can see that standard diffusion is ignoring the imputation signal. Facial motion editing (c) shows the unrealistic style-shifts for classical diffusion, refer Frame 39 and Frame 53.
  • Figure 2: Overview of our method. We employ two diffusion-based motion generators with shared audio encoder to model 3D facial and head motion separately.
  • Figure 3: Our facial motion generator takes noised vertex displacements, denoted as $x_t$, and the diffusion time step embedding as inputs to predict a denoised sample $\hat{x}_0$, leveraging both the audio features signal $\hat{A}$ and a person-specific feature vector $S_i$. Note that $N$ corresponds to the frame count of the sequence and $D$ to the number of vertices.
  • Figure 4: Illustration of standard diffusion (left) and our sparsely-guided diffusion (right), where in the forward diffusion process, part of the noisy input signal is replaced with the ground truth signal and a guidance flag of (0) and (1) is concatenated to the noisy and ground truth regions respectively.
  • Figure 5: Qualitative comparison: Our method outperforms the baseline in creating more accurate lip-synced facial animations with diverse head movements. Specifically, TalkSHOW produces animations with jittery artifacts, while SadTalker yields muted and generic animations.
  • ...and 5 more figures