Table of Contents
Fetching ...

Lifting Scheme-Based Implicit Disentanglement of Emotion-Related Facial Dynamics in the Wild

Xingjian Wang, Li Chai

TL;DR

This work tackles the challenge of emotion-related facial dynamics being obscured by emotion-irrelevant content in in-the-wild DFER. It introduces IFDD, an implicit, wavelet lifting-based framework that disentangles dynamic emotion cues from global context through a two-stage process: Inter-frame Static-Dynamic Splitting (ISSM) and Lifting-based Aggregation-Disentanglement (LADM). An explicit disentanglement loss combines task supervision with a global-context constraint to promote separation of dynamics from context. Across three challenging datasets, IFDD with CNN and ViT backbones achieves state-of-the-art or near-state-of-the-art performance with modest computational overhead, demonstrating robustness to noisy frames and improved per-emotion discrimination. The approach offers a versatile, backbone-agnostic paradigm for dynamic facial expression analysis with potential extensions to other video-worthy tasks.

Abstract

In-the-wild dynamic facial expression recognition (DFER) encounters a significant challenge in recognizing emotion-related expressions, which are often temporally and spatially diluted by emotion-irrelevant expressions and global context. Most prior DFER methods directly utilize coupled spatiotemporal representations that may incorporate weakly relevant features with emotion-irrelevant context bias. Several DFER methods highlight dynamic information for DFER, but following explicit guidance that may be vulnerable to irrelevant motion. In this paper, we propose a novel Implicit Facial Dynamics Disentanglement framework (IFDD). Through expanding wavelet lifting scheme to fully learnable framework, IFDD disentangles emotion-related dynamic information from emotion-irrelevant global context in an implicit manner, i.e., without exploit operations and external guidance. The disentanglement process contains two stages. The first is Inter-frame Static-dynamic Splitting Module (ISSM) for rough disentanglement estimation, which explores inter-frame correlation to generate content-aware splitting indexes on-the-fly. We utilize these indexes to split frame features into two groups, one with greater global similarity, and the other with more unique dynamic features. The second stage is Lifting-based Aggregation-Disentanglement Module (LADM) for further refinement. LADM first aggregates two groups of features from ISSM to obtain fine-grained global context features by an updater, and then disentangles emotion-related facial dynamic features from the global context by a predictor. Extensive experiments on in-the-wild datasets have demonstrated that IFDD outperforms prior supervised DFER methods with higher recognition accuracy and comparable efficiency. Code is available at https://github.com/CyberPegasus/IFDD.

Lifting Scheme-Based Implicit Disentanglement of Emotion-Related Facial Dynamics in the Wild

TL;DR

This work tackles the challenge of emotion-related facial dynamics being obscured by emotion-irrelevant content in in-the-wild DFER. It introduces IFDD, an implicit, wavelet lifting-based framework that disentangles dynamic emotion cues from global context through a two-stage process: Inter-frame Static-Dynamic Splitting (ISSM) and Lifting-based Aggregation-Disentanglement (LADM). An explicit disentanglement loss combines task supervision with a global-context constraint to promote separation of dynamics from context. Across three challenging datasets, IFDD with CNN and ViT backbones achieves state-of-the-art or near-state-of-the-art performance with modest computational overhead, demonstrating robustness to noisy frames and improved per-emotion discrimination. The approach offers a versatile, backbone-agnostic paradigm for dynamic facial expression analysis with potential extensions to other video-worthy tasks.

Abstract

In-the-wild dynamic facial expression recognition (DFER) encounters a significant challenge in recognizing emotion-related expressions, which are often temporally and spatially diluted by emotion-irrelevant expressions and global context. Most prior DFER methods directly utilize coupled spatiotemporal representations that may incorporate weakly relevant features with emotion-irrelevant context bias. Several DFER methods highlight dynamic information for DFER, but following explicit guidance that may be vulnerable to irrelevant motion. In this paper, we propose a novel Implicit Facial Dynamics Disentanglement framework (IFDD). Through expanding wavelet lifting scheme to fully learnable framework, IFDD disentangles emotion-related dynamic information from emotion-irrelevant global context in an implicit manner, i.e., without exploit operations and external guidance. The disentanglement process contains two stages. The first is Inter-frame Static-dynamic Splitting Module (ISSM) for rough disentanglement estimation, which explores inter-frame correlation to generate content-aware splitting indexes on-the-fly. We utilize these indexes to split frame features into two groups, one with greater global similarity, and the other with more unique dynamic features. The second stage is Lifting-based Aggregation-Disentanglement Module (LADM) for further refinement. LADM first aggregates two groups of features from ISSM to obtain fine-grained global context features by an updater, and then disentangles emotion-related facial dynamic features from the global context by a predictor. Extensive experiments on in-the-wild datasets have demonstrated that IFDD outperforms prior supervised DFER methods with higher recognition accuracy and comparable efficiency. Code is available at https://github.com/CyberPegasus/IFDD.

Paper Structure

This paper contains 34 sections, 9 equations, 5 figures, 14 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview framework of IFDD, which mainly consists of four parts: (1) multiscale backbone followed by pyramid aggregation; (2) Inter-frame Static-dynamic Splitting Module (ISSM); (3) Lifting-based Aggregation-Decoupling Module (LADM); (4) recognition head with decoupling loss. Based on the spatiotemporal features extracted by backbone, ISSM and LADM modules are proposed to further decouple emotion-related dynamic features from emotion-irrelevant global context.
  • Figure 2: Visualization analysis on the gradient attention of $\{Y_D,Y_S\}$ by Grad-CAM in left subimage and the distribution of classification features by t-SNE in right subimage. For left subimage, clips and attention heatmaps are shown in different columns, while basic emotions are shown in different rows. Details can be found in ablation study and extended version.
  • Figure 3: Schematic diagram of different ISSM variants. T, H, and W are the temporal and spatial size of the input feature $\textbf{x}$ respectively. $I$ denotes linear interpolation and $\otimes$ denotes element-wise multiplication.
  • Figure 4: Visual Comparison with baselines and prior methods for per-class accuracy on DFEW test set. The envelope area of a radar chart represents its corresponding average class accuracy, i.e., UAR.
  • Figure 5: Distribution of predicted per-class confidence on DFEW test set (1-th fold). Vanilla baselines and IFDD variants are involved. Sample points of different emotions and their kernel density are shown in different columns. Positive and negative samples are labeled by different colors, i.e., orange for positive ones and green for negative ones.