Table of Contents
Fetching ...

4D Facial Expression Diffusion Model

Kaifeng Zou, Sylvain Faisan, Boyang Yu, Sébastien Valette, Hyewon Seo

TL;DR

This work presents a diffusion-model-based framework for 4D facial expression generation that first learns realistic landmark dynamics unconditionally and then supports multiple downstream conditioning tasks through plug-and-play reverse-process guidance. A landmark-guided mesh deformation module retargets the learned landmark trajectories to a neutral 3D mesh by predicting per-vertex displacements conditioned on facial geometry, yielding high-fidelity, diverse mesh sequences. The approach supports label and text conditioning, partial-sequence filling, and geometry-adaptive generation, all demonstrated on CoMA and BU-4DFE datasets with 68 FLAME landmarks, achieving state-of-the-art performance in both landmark quality and mesh realism. The method offers a versatile, data-efficient means to synthesize expressive 4D faces, with practical implications for virtual avatars, animation, and human-computer interaction. All conditioning is accomplished without retraining the unconditional model, enabling efficient exploration of multiple control signals during generation.

Abstract

Facial expression generation is one of the most challenging and long-sought aspects of character animation, with many interesting applications. The challenging task, traditionally having relied heavily on digital craftspersons, remains yet to be explored. In this paper, we introduce a generative framework for generating 3D facial expression sequences (i.e. 4D faces) that can be conditioned on different inputs to animate an arbitrary 3D face mesh. It is composed of two tasks: (1) Learning the generative model that is trained over a set of 3D landmark sequences, and (2) Generating 3D mesh sequences of an input facial mesh driven by the generated landmark sequences. The generative model is based on a Denoising Diffusion Probabilistic Model (DDPM), which has achieved remarkable success in generative tasks of other domains. While it can be trained unconditionally, its reverse process can still be conditioned by various condition signals. This allows us to efficiently develop several downstream tasks involving various conditional generation, by using expression labels, text, partial sequences, or simply a facial geometry. To obtain the full mesh deformation, we then develop a landmark-guided encoder-decoder to apply the geometrical deformation embedded in landmarks on a given facial mesh. Experiments show that our model has learned to generate realistic, quality expressions solely from the dataset of relatively small size, improving over the state-of-the-art methods. Videos and qualitative comparisons with other methods can be found at \url{https://github.com/ZOUKaifeng/4DFM}.

4D Facial Expression Diffusion Model

TL;DR

This work presents a diffusion-model-based framework for 4D facial expression generation that first learns realistic landmark dynamics unconditionally and then supports multiple downstream conditioning tasks through plug-and-play reverse-process guidance. A landmark-guided mesh deformation module retargets the learned landmark trajectories to a neutral 3D mesh by predicting per-vertex displacements conditioned on facial geometry, yielding high-fidelity, diverse mesh sequences. The approach supports label and text conditioning, partial-sequence filling, and geometry-adaptive generation, all demonstrated on CoMA and BU-4DFE datasets with 68 FLAME landmarks, achieving state-of-the-art performance in both landmark quality and mesh realism. The method offers a versatile, data-efficient means to synthesize expressive 4D faces, with practical implications for virtual avatars, animation, and human-computer interaction. All conditioning is accomplished without retraining the unconditional model, enabling efficient exploration of multiple control signals during generation.

Abstract

Facial expression generation is one of the most challenging and long-sought aspects of character animation, with many interesting applications. The challenging task, traditionally having relied heavily on digital craftspersons, remains yet to be explored. In this paper, we introduce a generative framework for generating 3D facial expression sequences (i.e. 4D faces) that can be conditioned on different inputs to animate an arbitrary 3D face mesh. It is composed of two tasks: (1) Learning the generative model that is trained over a set of 3D landmark sequences, and (2) Generating 3D mesh sequences of an input facial mesh driven by the generated landmark sequences. The generative model is based on a Denoising Diffusion Probabilistic Model (DDPM), which has achieved remarkable success in generative tasks of other domains. While it can be trained unconditionally, its reverse process can still be conditioned by various condition signals. This allows us to efficiently develop several downstream tasks involving various conditional generation, by using expression labels, text, partial sequences, or simply a facial geometry. To obtain the full mesh deformation, we then develop a landmark-guided encoder-decoder to apply the geometrical deformation embedded in landmarks on a given facial mesh. Experiments show that our model has learned to generate realistic, quality expressions solely from the dataset of relatively small size, improving over the state-of-the-art methods. Videos and qualitative comparisons with other methods can be found at \url{https://github.com/ZOUKaifeng/4DFM}.
Paper Structure (20 sections, 9 equations, 8 figures, 5 tables, 4 algorithms)

This paper contains 20 sections, 9 equations, 8 figures, 5 tables, 4 algorithms.

Figures (8)

  • Figure 1: Overview of the proposed approach. Generally, the diffusion process is used to train the noise approximator while the reverse process is used to sample $x_0$ from the distribution $q$. But some tasks developed in Sec. \ref{['sec:downstrm']} require both processes for sampling. The bidirectional transformer takes as input the sum of the outputs of three embedding layers: the temporal embedding layer (TE) that takes as input $t$, the positional encoding layer (PE) that takes as input an integer sequence from $1$ to $F$, and the feature embedding layer (FE) that takes $x_t$. The landmark-guided encoder-decoder retargets the expression of $L_f$ onto the input mesh $M$ to estimate $M_f$ at each frame.
  • Figure 2: Animated mesh sequences guided by the label "mouth side" (top), "mouth extreme" (middle), and "cheeks in" (bottom). The meshes are obtained by retargeting the expression of the generated $x_0$ on different neutral faces.
  • Figure 3: Text-driven generation results obtained by the enriched text task ("from neutral face to bareteeth" (top)), and by the raw text task ("angry mouth down" (middle), "disgust high smile" (bottom)). The input texts used for the raw text task are the combinations of two terms used for training. For instance, "disgust high smile" is a new description that hasn't been seen before, which combines "disgust" and "high smile".
  • Figure 4: Diversity of expressions generated with the label "eyebrow" (left), and "high smile" (right) in the geometry-adaptive generation task. All illustrated sequences are of type N2E. Note that eyebrows can be either lowered (the second and third rows) or raised (the first row). Although the poses of maximal expression intensity look all similar in the three sequences of "high smile", their temporal properties are significantly different.
  • Figure 5: Qualitative comparison of our method (b) with S2D (c), CoMA (d), and Linear (e) in the landmark-guided deformation of a given mesh. The ground truth meshes are given in the first column (a). The expression of the first row is close to the neutral face and that of the second row is taken from a sequence labeled as "mouth extreme".
  • ...and 3 more figures