Table of Contents
Fetching ...

AnimateMe: 4D Facial Expressions via Diffusion Models

Dimitrios Gerogiannis, Foivos Paraperas Papantoniou, Rolandos Alexandros Potamias, Alexandros Lattas, Stylianos Moschoglou, Stylianos Ploumpis, Stefanos Zafeiriou

TL;DR

AnimateMe introduces a mesh-space diffusion framework that directly diffs a fixed-topology facial mesh using a Graph Neural Network denoiser, enabling controllable and high-fidelity 4D expressions. A novel consistent noise sampling strategy ensures temporal coherence and faster generation by reusing a shared noise sequence across frames. The model trains frame-by-frame using a static DDPM objective on deformations from a neutral mesh, and extends to textured 4D animation via a latent diffusion model conditioned on neutral texture latent and expression signals. Evaluations on CoMA show superior handling of extreme expressions, and the textured extension demonstrates alignment between geometry and texture on MimicMe, highlighting scalability to large datasets. These results establish a data-driven, end-to-end approach for 4D facial expression synthesis with potential applications in avatar realism and animation pipelines, while emphasizing ethical considerations around consent and misuse.

Abstract

The field of photorealistic 3D avatar reconstruction and generation has garnered significant attention in recent years; however, animating such avatars remains challenging. Recent advances in diffusion models have notably enhanced the capabilities of generative models in 2D animation. In this work, we directly utilize these models within the 3D domain to achieve controllable and high-fidelity 4D facial animation. By integrating the strengths of diffusion processes and geometric deep learning, we employ Graph Neural Networks (GNNs) as denoising diffusion models in a novel approach, formulating the diffusion process directly on the mesh space and enabling the generation of 3D facial expressions. This facilitates the generation of facial deformations through a mesh-diffusion-based model. Additionally, to ensure temporal coherence in our animations, we propose a consistent noise sampling method. Under a series of both quantitative and qualitative experiments, we showcase that the proposed method outperforms prior work in 4D expression synthesis by generating high-fidelity extreme expressions. Furthermore, we applied our method to textured 4D facial expression generation, implementing a straightforward extension that involves training on a large-scale textured 4D facial expression database.

AnimateMe: 4D Facial Expressions via Diffusion Models

TL;DR

AnimateMe introduces a mesh-space diffusion framework that directly diffs a fixed-topology facial mesh using a Graph Neural Network denoiser, enabling controllable and high-fidelity 4D expressions. A novel consistent noise sampling strategy ensures temporal coherence and faster generation by reusing a shared noise sequence across frames. The model trains frame-by-frame using a static DDPM objective on deformations from a neutral mesh, and extends to textured 4D animation via a latent diffusion model conditioned on neutral texture latent and expression signals. Evaluations on CoMA show superior handling of extreme expressions, and the textured extension demonstrates alignment between geometry and texture on MimicMe, highlighting scalability to large datasets. These results establish a data-driven, end-to-end approach for 4D facial expression synthesis with potential applications in avatar realism and animation pipelines, while emphasizing ethical considerations around consent and misuse.

Abstract

The field of photorealistic 3D avatar reconstruction and generation has garnered significant attention in recent years; however, animating such avatars remains challenging. Recent advances in diffusion models have notably enhanced the capabilities of generative models in 2D animation. In this work, we directly utilize these models within the 3D domain to achieve controllable and high-fidelity 4D facial animation. By integrating the strengths of diffusion processes and geometric deep learning, we employ Graph Neural Networks (GNNs) as denoising diffusion models in a novel approach, formulating the diffusion process directly on the mesh space and enabling the generation of 3D facial expressions. This facilitates the generation of facial deformations through a mesh-diffusion-based model. Additionally, to ensure temporal coherence in our animations, we propose a consistent noise sampling method. Under a series of both quantitative and qualitative experiments, we showcase that the proposed method outperforms prior work in 4D expression synthesis by generating high-fidelity extreme expressions. Furthermore, we applied our method to textured 4D facial expression generation, implementing a straightforward extension that involves training on a large-scale textured 4D facial expression database.
Paper Structure (26 sections, 2 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 2 equations, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: Overview of the proposed frame generation method: Our method generates frames by integrating a point cloud DDPM with an SCN denoising model, conditioned on a concatenated expression and timestep conditioning. It employs spiral convolutional layers, modulating output features with a simple gating and bias mechanism tailored to the conditions. Throughout this process, noise is predicted and systematically subtracted at each timestep until the frame is completely denoised and thus generated. While the method operates on deformations, for visualization, we apply them to the neutral mesh for all timesteps, to show the temporal evolution of the diffusion process.
  • Figure 2: Animation generation via consistent noise sampling: The process initiates by sampling the initial noise $\bf{\epsilon}$ and the denoising noise sequence $\bf{z}$ over $T-1$ timesteps. The diffusion process begins with the first frame using the full range of timesteps. Upon reaching a late denoised stage at $t_s$, the generation for subsequent frames starts in parallel from this denoised state, utilizing only the remaining $t_s$ timesteps. All frames share the same denoising sequence $\bf{z}$, with differences arising from the expression intensity.
  • Figure 3: Average per frame specificity error (mm) between the proposed and the baseline methods LSTM potamias2020learning and MO3DGANotberdout2022sparse.
  • Figure 4: Comparison of final frame generations with their respective ground truths, between ours and MO3DGAN otberdout2022sparse. LSTM results are omitted for brevity and due to significantly worse performance.
  • Figure 5: Qualitative comparison of extreme expression generations between ours and MO3DGANotberdout2022sparse. Final expressions are illustrated in the main paper. For full-length dynamic 4D expressions, please refer to the supplementary material.
  • ...and 4 more figures