Investigating Identity Signals in Conversational Facial Dynamics via Disentangled Expression Features
Masoumeh Chapariniya, Pierre Vuillecard, Jean-Marc Odobez, Volker Dellwo, Teodora Vukovic
TL;DR
This paper tackles the problem of identifying individuals from dynamic facial expressions independent of static appearance. It introduces a disentangled pipeline based on the FLAME 3D morphable model to extract frame-wise expression and jaw dynamics while discarding shape, and uses a Conformer temporal encoder trained with supervised contrastive learning to classify across 1,429 speakers. A novel Drift-to-Noise Ratio (DNR) is proposed to quantify disentanglement quality by comparing inter-session drift in estimated shape to within-session noise, linking disentanglement reliability to recognition performance. The results on the CANDOR dataset show that pure facial dynamics encode strong identity signals, with the Conformer achieving around 60–61% accuracy in a 1,429-way task, and longer temporal context and more training data further boosting cross-session robustness. These findings have implications for social perception, clinical assessment, and personalized human–computer interaction, while also highlighting challenges in cross-session stability and the potential for eye gaze signals to enhance identity capture.
Abstract
This work investigates whether individuals can be identified solely through the pure dynamical components of their facial expressions, independent of static facial appearance. We leverage the FLAME 3D morphable model to achieve explicit disentanglement between facial shape and expression dynamics, extracting frame-by-frame parameters from conversational videos while retaining only expression and jaw coefficients. On the CANDOR dataset of 1,429 speakers in naturalistic conversations, our Conformer model with supervised contrastive learning achieves 61.14\%accuracy on 1,429-way classification -- 458 times above chance -- demonstrating that facial dynamics carry strong identity signatures. We introduce a drift-to-noise ratio (DNR) that quantifies the reliability of shape expression separation by measuring across-session shape changes relative to within-session variability. DNR strongly negatively correlates with recognition performance, confirming that unstable shape estimation compromises dynamic identification. Our findings reveal person-specific signatures in conversational facial dynamics, with implications for social perception and clinical assessment.
