Table of Contents
Fetching ...

SMIRK: 3D Facial Expressions through Analysis-by-Neural-Synthesis

George Retsinas, Panagiotis P. Filntisis, Radek Danecek, Victoria F. Abrevaya, Anastasios Roussos, Timo Bolkart, Petros Maragos

TL;DR

SMIRK tackles the challenge of reconstructing expressive 3D faces from single images by replacing traditional differentiable rendering with a neural image-to-image translator, which provides geometry-focused supervision and reduces domain gaps. It introduces an expression-augmented training cycle that enforces consistency between predicted FLAME parameters and augmented expressions, enabling robust recovery of extreme, asymmetric, and subtle expressions. The framework uses FLAME for geometry, a U-Net neural renderer for appearance-free supervision, and a cycle-based augmentation strategy to enrich expression diversity. Experimental results demonstrate improved expressive reconstruction, competitive emotion-recognition metrics, and strong subjective assessments, highlighting SMIRK's practical impact for expressive 3D facial modeling in-the-wild.

Abstract

While existing methods for 3D face reconstruction from in-the-wild images excel at recovering the overall face shape, they commonly miss subtle, extreme, asymmetric, or rarely observed expressions. We improve upon these methods with SMIRK (Spatial Modeling for Image-based Reconstruction of Kinesics), which faithfully reconstructs expressive 3D faces from images. We identify two key limitations in existing methods: shortcomings in their self-supervised training formulation, and a lack of expression diversity in the training images. For training, most methods employ differentiable rendering to compare a predicted face mesh with the input image, along with a plethora of additional loss functions. This differentiable rendering loss not only has to provide supervision to optimize for 3D face geometry, camera, albedo, and lighting, which is an ill-posed optimization problem, but the domain gap between rendering and input image further hinders the learning process. Instead, SMIRK replaces the differentiable rendering with a neural rendering module that, given the rendered predicted mesh geometry, and sparsely sampled pixels of the input image, generates a face image. As the neural rendering gets color information from sampled image pixels, supervising with neural rendering-based reconstruction loss can focus solely on the geometry. Further, it enables us to generate images of the input identity with varying expressions while training. These are then utilized as input to the reconstruction model and used as supervision with ground truth geometry. This effectively augments the training data and enhances the generalization for diverse expressions. Our qualitative, quantitative and particularly our perceptual evaluations demonstrate that SMIRK achieves the new state-of-the art performance on accurate expression reconstruction. Project webpage: https://georgeretsi.github.io/smirk/.

SMIRK: 3D Facial Expressions through Analysis-by-Neural-Synthesis

TL;DR

SMIRK tackles the challenge of reconstructing expressive 3D faces from single images by replacing traditional differentiable rendering with a neural image-to-image translator, which provides geometry-focused supervision and reduces domain gaps. It introduces an expression-augmented training cycle that enforces consistency between predicted FLAME parameters and augmented expressions, enabling robust recovery of extreme, asymmetric, and subtle expressions. The framework uses FLAME for geometry, a U-Net neural renderer for appearance-free supervision, and a cycle-based augmentation strategy to enrich expression diversity. Experimental results demonstrate improved expressive reconstruction, competitive emotion-recognition metrics, and strong subjective assessments, highlighting SMIRK's practical impact for expressive 3D facial modeling in-the-wild.

Abstract

While existing methods for 3D face reconstruction from in-the-wild images excel at recovering the overall face shape, they commonly miss subtle, extreme, asymmetric, or rarely observed expressions. We improve upon these methods with SMIRK (Spatial Modeling for Image-based Reconstruction of Kinesics), which faithfully reconstructs expressive 3D faces from images. We identify two key limitations in existing methods: shortcomings in their self-supervised training formulation, and a lack of expression diversity in the training images. For training, most methods employ differentiable rendering to compare a predicted face mesh with the input image, along with a plethora of additional loss functions. This differentiable rendering loss not only has to provide supervision to optimize for 3D face geometry, camera, albedo, and lighting, which is an ill-posed optimization problem, but the domain gap between rendering and input image further hinders the learning process. Instead, SMIRK replaces the differentiable rendering with a neural rendering module that, given the rendered predicted mesh geometry, and sparsely sampled pixels of the input image, generates a face image. As the neural rendering gets color information from sampled image pixels, supervising with neural rendering-based reconstruction loss can focus solely on the geometry. Further, it enables us to generate images of the input identity with varying expressions while training. These are then utilized as input to the reconstruction model and used as supervision with ground truth geometry. This effectively augments the training data and enhances the generalization for diverse expressions. Our qualitative, quantitative and particularly our perceptual evaluations demonstrate that SMIRK achieves the new state-of-the art performance on accurate expression reconstruction. Project webpage: https://georgeretsi.github.io/smirk/.
Paper Structure (31 sections, 2 equations, 17 figures, 11 tables)

This paper contains 31 sections, 2 equations, 17 figures, 11 tables.

Figures (17)

  • Figure 1: Reconstruction pass. An input image is passed to the encoder which regresses FLAME and camera parameters. A 3D shape is reconstructed, rendered with a differentiable rasterizer and finally translated into the output domain with the image translation network. Then, standard self-supervised landmark, photometric and perceptual losses are computed.
  • Figure 2: Masking Process. An input image is masked to obscure the face (upper path), then we sample random pixels to be unmasked (lower path)
  • Figure 3: Augmented cycle pass. The FLAME expression parameters of an existing reconstruction are modified. The resulting modified face is then rendered using our neural renderer. The rendering is then passed to the face reconstruction encoder to regress the FLAME parameters and a consistency loss between the modified input and reconstructed FLAME parameters is computed.
  • Figure 4: Neural expression augmentation. Our neural renderer enables us to modify the expression, generating a new image-3D training pair. We can edit the expression with random noise, permutation from other reconstructions, template injection, or zeroing.
  • Figure 5: Visual comparison of 3D face reconstruction. From left to right: Input, Deep3DFaceRecondeng2019accurate, FOCUSli2021fit, DECAfeng2021learning, EMOCAv2danecek2022emoca, and SMIRK. Many more examples can also be found in the Suppl. Mat. and the demo video in our webpage.
  • ...and 12 more figures