Table of Contents
Fetching ...

FIELDS: Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision

Chen Ling, Henglin Shi, Hedvig Kjellström

TL;DR

FIELDS tackles the challenge of accurate, emotion-rich 3D face reconstruction from a single image by introducing direct 3D expression supervision from spontaneous 4D BP4D scans and an auxiliary, intensity-aware emotion head. The framework couples FLAME-based 3DMM parameters with a hybrid 2D/3D training objective, coupling a 3D expression loss with VA/C discrete emotion supervision and robust 2D consistency terms. Regularization from a stable base model and a two-step training procedure prevent decoder masking and over-exaggeration, yielding geometry that remains faithful while preserving subtle affective cues. Across six datasets, FIELDS achieves strong VA regression and emotion classification while maintaining state-of-the-art geometric fidelity, supported by comprehensive ablations and qualitative analyses.

Abstract

Facial expressions convey the bulk of emotional information in human communication, yet existing 3D face reconstruction methods often miss subtle affective details due to reliance on 2D supervision and lack of 3D ground truth. We propose FIELDS (Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision) to address these limitations by extending self-supervised 2D image consistency cues with direct 3D expression parameter supervision and an auxiliary emotion recognition branch. Our encoder is guided by authentic expression parameters from spontaneous 4D facial scans, while an intensity-aware emotion loss encourages the 3D expression parameters to capture genuine emotion content without exaggeration. This dual-supervision strategy bridges the 2D/3D domain gap and mitigates expression-intensity bias, yielding high-fidelity 3D reconstructions that preserve subtle emotional cues. From a single image, FIELDS produces emotion-rich face models with highly realistic expressions, significantly improving in-the-wild facial expression recognition performance without sacrificing naturalness.

FIELDS: Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision

TL;DR

FIELDS tackles the challenge of accurate, emotion-rich 3D face reconstruction from a single image by introducing direct 3D expression supervision from spontaneous 4D BP4D scans and an auxiliary, intensity-aware emotion head. The framework couples FLAME-based 3DMM parameters with a hybrid 2D/3D training objective, coupling a 3D expression loss with VA/C discrete emotion supervision and robust 2D consistency terms. Regularization from a stable base model and a two-step training procedure prevent decoder masking and over-exaggeration, yielding geometry that remains faithful while preserving subtle affective cues. Across six datasets, FIELDS achieves strong VA regression and emotion classification while maintaining state-of-the-art geometric fidelity, supported by comprehensive ablations and qualitative analyses.

Abstract

Facial expressions convey the bulk of emotional information in human communication, yet existing 3D face reconstruction methods often miss subtle affective details due to reliance on 2D supervision and lack of 3D ground truth. We propose FIELDS (Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision) to address these limitations by extending self-supervised 2D image consistency cues with direct 3D expression parameter supervision and an auxiliary emotion recognition branch. Our encoder is guided by authentic expression parameters from spontaneous 4D facial scans, while an intensity-aware emotion loss encourages the 3D expression parameters to capture genuine emotion content without exaggeration. This dual-supervision strategy bridges the 2D/3D domain gap and mitigates expression-intensity bias, yielding high-fidelity 3D reconstructions that preserve subtle emotional cues. From a single image, FIELDS produces emotion-rich face models with highly realistic expressions, significantly improving in-the-wild facial expression recognition performance without sacrificing naturalness.

Paper Structure

This paper contains 24 sections, 14 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Comparison between the baseline approach, which relies on external emotion consistency losses, and our FIELDS framework, which introduces direct 3D expression parameter supervision from scan data alongside an integrated emotion‐recognition branch.
  • Figure 2: A circumplex model of affect valencearousalfeldman1998independence with discrete emotions universals overlaid. Adapted from vaemotion.
  • Figure 3: Illustration of the FIELDS pipeline. Given an input image $I$, encoders predict FLAME parameters (shape $\beta$, expression $\Psi$, pose $\Theta$) and an appearance token $T$; FLAME produces geometry maps that, together with $I_b$ and $T$, are fused by a synthesizer to generate $\hat{I}$. Besides the inherited 2D consistency losses (gray box), we add 3D supervision (purple box): (i) BP4D provides fitted FLAME targets for an expression parameter loss $\mathcal{L}_{\text{3D-GT}}$, and (ii) AffectNet supplies labels for an auxiliary emotion head trained with $\mathcal{L}_{\text{emo}}$ on $(\Psi,\beta)$. Initialization: the FLAME parameter encoders are pretrained, the token encoder and synthesizer are initialized from TEASER teaser, and the emotion head is randomly initialized. During training we freeze only the pose encoder $E_\theta$ and update the remaining components jointly.
  • Figure 4: Visual Examples of 3D Face Reconstruction. From top to bottom: Emotion class label, Valence value label, Arousal value label, Input image, EMOCA emoca, SMIRK smirk, TEASER teaser, FIELDS.
  • Figure 5: Emotion classification, per-class accuracy (mean$\pm$std) -- grouped bars for each emotion.
  • ...and 10 more figures