Table of Contents
Fetching ...

Investigating Identity Signals in Conversational Facial Dynamics via Disentangled Expression Features

Masoumeh Chapariniya, Pierre Vuillecard, Jean-Marc Odobez, Volker Dellwo, Teodora Vukovic

TL;DR

This paper tackles the problem of identifying individuals from dynamic facial expressions independent of static appearance. It introduces a disentangled pipeline based on the FLAME 3D morphable model to extract frame-wise expression and jaw dynamics while discarding shape, and uses a Conformer temporal encoder trained with supervised contrastive learning to classify across 1,429 speakers. A novel Drift-to-Noise Ratio (DNR) is proposed to quantify disentanglement quality by comparing inter-session drift in estimated shape to within-session noise, linking disentanglement reliability to recognition performance. The results on the CANDOR dataset show that pure facial dynamics encode strong identity signals, with the Conformer achieving around 60–61% accuracy in a 1,429-way task, and longer temporal context and more training data further boosting cross-session robustness. These findings have implications for social perception, clinical assessment, and personalized human–computer interaction, while also highlighting challenges in cross-session stability and the potential for eye gaze signals to enhance identity capture.

Abstract

This work investigates whether individuals can be identified solely through the pure dynamical components of their facial expressions, independent of static facial appearance. We leverage the FLAME 3D morphable model to achieve explicit disentanglement between facial shape and expression dynamics, extracting frame-by-frame parameters from conversational videos while retaining only expression and jaw coefficients. On the CANDOR dataset of 1,429 speakers in naturalistic conversations, our Conformer model with supervised contrastive learning achieves 61.14\%accuracy on 1,429-way classification -- 458 times above chance -- demonstrating that facial dynamics carry strong identity signatures. We introduce a drift-to-noise ratio (DNR) that quantifies the reliability of shape expression separation by measuring across-session shape changes relative to within-session variability. DNR strongly negatively correlates with recognition performance, confirming that unstable shape estimation compromises dynamic identification. Our findings reveal person-specific signatures in conversational facial dynamics, with implications for social perception and clinical assessment.

Investigating Identity Signals in Conversational Facial Dynamics via Disentangled Expression Features

TL;DR

This paper tackles the problem of identifying individuals from dynamic facial expressions independent of static appearance. It introduces a disentangled pipeline based on the FLAME 3D morphable model to extract frame-wise expression and jaw dynamics while discarding shape, and uses a Conformer temporal encoder trained with supervised contrastive learning to classify across 1,429 speakers. A novel Drift-to-Noise Ratio (DNR) is proposed to quantify disentanglement quality by comparing inter-session drift in estimated shape to within-session noise, linking disentanglement reliability to recognition performance. The results on the CANDOR dataset show that pure facial dynamics encode strong identity signals, with the Conformer achieving around 60–61% accuracy in a 1,429-way task, and longer temporal context and more training data further boosting cross-session robustness. These findings have implications for social perception, clinical assessment, and personalized human–computer interaction, while also highlighting challenges in cross-session stability and the potential for eye gaze signals to enhance identity capture.

Abstract

This work investigates whether individuals can be identified solely through the pure dynamical components of their facial expressions, independent of static facial appearance. We leverage the FLAME 3D morphable model to achieve explicit disentanglement between facial shape and expression dynamics, extracting frame-by-frame parameters from conversational videos while retaining only expression and jaw coefficients. On the CANDOR dataset of 1,429 speakers in naturalistic conversations, our Conformer model with supervised contrastive learning achieves 61.14\%accuracy on 1,429-way classification -- 458 times above chance -- demonstrating that facial dynamics carry strong identity signatures. We introduce a drift-to-noise ratio (DNR) that quantifies the reliability of shape expression separation by measuring across-session shape changes relative to within-session variability. DNR strongly negatively correlates with recognition performance, confirming that unstable shape estimation compromises dynamic identification. Our findings reveal person-specific signatures in conversational facial dynamics, with implications for social perception and clinical assessment.

Paper Structure

This paper contains 24 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Dynamics-only identification pipeline. A video utterance is processed by a frozen front end (VGGHeads + FLAME) to obtain per-frame parameters. We retain only expression $\psi$ and jaw $\boldsymbol{\theta}_{j}$ to form $\mathbf{X}\in\mathbb{R}^{T\times 103}$, which a temporal encoder maps to an identity prediction.
  • Figure 2: Performance predictors. (left) DNR vs. recall: identities are binned by drift-to-noise ratio; line shows the mean per-person recall per bin, the shaded band is the 95% CI, and bars denote the number of persons. Higher DNR corresponds to lower recall. (middle) Accuracy vs. length: per-utterance accuracy improves with longer clips; GA (same-session) remains above GB (cross-session) at all lengths. (right) Accuracy vs. training utterances: per-person accuracy increases with more training clips, with the largest gains for GB.