4D Facial Expression Diffusion Model
Kaifeng Zou, Sylvain Faisan, Boyang Yu, Sébastien Valette, Hyewon Seo
TL;DR
This work presents a diffusion-model-based framework for 4D facial expression generation that first learns realistic landmark dynamics unconditionally and then supports multiple downstream conditioning tasks through plug-and-play reverse-process guidance. A landmark-guided mesh deformation module retargets the learned landmark trajectories to a neutral 3D mesh by predicting per-vertex displacements conditioned on facial geometry, yielding high-fidelity, diverse mesh sequences. The approach supports label and text conditioning, partial-sequence filling, and geometry-adaptive generation, all demonstrated on CoMA and BU-4DFE datasets with 68 FLAME landmarks, achieving state-of-the-art performance in both landmark quality and mesh realism. The method offers a versatile, data-efficient means to synthesize expressive 4D faces, with practical implications for virtual avatars, animation, and human-computer interaction. All conditioning is accomplished without retraining the unconditional model, enabling efficient exploration of multiple control signals during generation.
Abstract
Facial expression generation is one of the most challenging and long-sought aspects of character animation, with many interesting applications. The challenging task, traditionally having relied heavily on digital craftspersons, remains yet to be explored. In this paper, we introduce a generative framework for generating 3D facial expression sequences (i.e. 4D faces) that can be conditioned on different inputs to animate an arbitrary 3D face mesh. It is composed of two tasks: (1) Learning the generative model that is trained over a set of 3D landmark sequences, and (2) Generating 3D mesh sequences of an input facial mesh driven by the generated landmark sequences. The generative model is based on a Denoising Diffusion Probabilistic Model (DDPM), which has achieved remarkable success in generative tasks of other domains. While it can be trained unconditionally, its reverse process can still be conditioned by various condition signals. This allows us to efficiently develop several downstream tasks involving various conditional generation, by using expression labels, text, partial sequences, or simply a facial geometry. To obtain the full mesh deformation, we then develop a landmark-guided encoder-decoder to apply the geometrical deformation embedded in landmarks on a given facial mesh. Experiments show that our model has learned to generate realistic, quality expressions solely from the dataset of relatively small size, improving over the state-of-the-art methods. Videos and qualitative comparisons with other methods can be found at \url{https://github.com/ZOUKaifeng/4DFM}.
