AnimateMe: 4D Facial Expressions via Diffusion Models
Dimitrios Gerogiannis, Foivos Paraperas Papantoniou, Rolandos Alexandros Potamias, Alexandros Lattas, Stylianos Moschoglou, Stylianos Ploumpis, Stefanos Zafeiriou
TL;DR
AnimateMe introduces a mesh-space diffusion framework that directly diffs a fixed-topology facial mesh using a Graph Neural Network denoiser, enabling controllable and high-fidelity 4D expressions. A novel consistent noise sampling strategy ensures temporal coherence and faster generation by reusing a shared noise sequence across frames. The model trains frame-by-frame using a static DDPM objective on deformations from a neutral mesh, and extends to textured 4D animation via a latent diffusion model conditioned on neutral texture latent and expression signals. Evaluations on CoMA show superior handling of extreme expressions, and the textured extension demonstrates alignment between geometry and texture on MimicMe, highlighting scalability to large datasets. These results establish a data-driven, end-to-end approach for 4D facial expression synthesis with potential applications in avatar realism and animation pipelines, while emphasizing ethical considerations around consent and misuse.
Abstract
The field of photorealistic 3D avatar reconstruction and generation has garnered significant attention in recent years; however, animating such avatars remains challenging. Recent advances in diffusion models have notably enhanced the capabilities of generative models in 2D animation. In this work, we directly utilize these models within the 3D domain to achieve controllable and high-fidelity 4D facial animation. By integrating the strengths of diffusion processes and geometric deep learning, we employ Graph Neural Networks (GNNs) as denoising diffusion models in a novel approach, formulating the diffusion process directly on the mesh space and enabling the generation of 3D facial expressions. This facilitates the generation of facial deformations through a mesh-diffusion-based model. Additionally, to ensure temporal coherence in our animations, we propose a consistent noise sampling method. Under a series of both quantitative and qualitative experiments, we showcase that the proposed method outperforms prior work in 4D expression synthesis by generating high-fidelity extreme expressions. Furthermore, we applied our method to textured 4D facial expression generation, implementing a straightforward extension that involves training on a large-scale textured 4D facial expression database.
