FIELDS: Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision
Chen Ling, Henglin Shi, Hedvig Kjellström
TL;DR
FIELDS tackles the challenge of accurate, emotion-rich 3D face reconstruction from a single image by introducing direct 3D expression supervision from spontaneous 4D BP4D scans and an auxiliary, intensity-aware emotion head. The framework couples FLAME-based 3DMM parameters with a hybrid 2D/3D training objective, coupling a 3D expression loss with VA/C discrete emotion supervision and robust 2D consistency terms. Regularization from a stable base model and a two-step training procedure prevent decoder masking and over-exaggeration, yielding geometry that remains faithful while preserving subtle affective cues. Across six datasets, FIELDS achieves strong VA regression and emotion classification while maintaining state-of-the-art geometric fidelity, supported by comprehensive ablations and qualitative analyses.
Abstract
Facial expressions convey the bulk of emotional information in human communication, yet existing 3D face reconstruction methods often miss subtle affective details due to reliance on 2D supervision and lack of 3D ground truth. We propose FIELDS (Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision) to address these limitations by extending self-supervised 2D image consistency cues with direct 3D expression parameter supervision and an auxiliary emotion recognition branch. Our encoder is guided by authentic expression parameters from spontaneous 4D facial scans, while an intensity-aware emotion loss encourages the 3D expression parameters to capture genuine emotion content without exaggeration. This dual-supervision strategy bridges the 2D/3D domain gap and mitigates expression-intensity bias, yielding high-fidelity 3D reconstructions that preserve subtle emotional cues. From a single image, FIELDS produces emotion-rich face models with highly realistic expressions, significantly improving in-the-wild facial expression recognition performance without sacrificing naturalness.
