Towards Context-Aware Emotion Recognition Debiasing from a Causal Demystification Perspective via De-confounded Training

Dingkang Yang; Kun Yang; Haopeng Kuang; Zhaoyu Chen; Yuzheng Wang; Lihua Zhang

Towards Context-Aware Emotion Recognition Debiasing from a Causal Demystification Perspective via De-confounded Training

Dingkang Yang, Kun Yang, Haopeng Kuang, Zhaoyu Chen, Yuzheng Wang, Lihua Zhang

TL;DR

This work tackles context bias in Context-Aware Emotion Recognition (CAER) by framing CAER as a causal inference problem and introducing CCIM, a plug-and-play module that approximates the causal effect via backdoor adjustment under the do-operator $P(Y|do(X))$. The method constructs a confounder dictionary $\mathbf{Z}$ from masked contexts and employs NWGM-based approximation with attention over context prototypes to de-confound training. Empirical results on EMOTIC, CAER-S, and GroupWalk show consistent gains over strong baselines across discrete and continuous emotion measures, with ablations validating the components of CCIM. The approach advances unbiased emotion understanding in uncontrolled environments and offers a general framework for debiasing context-driven tasks in vision and multimodal learning.

Abstract

Understanding emotions from diverse contexts has received widespread attention in computer vision communities. The core philosophy of Context-Aware Emotion Recognition (CAER) is to provide valuable semantic cues for recognizing the emotions of target persons by leveraging rich contextual information. Current approaches invariably focus on designing sophisticated structures to extract perceptually critical representations from contexts. Nevertheless, a long-neglected dilemma is that a severe context bias in existing datasets results in an unbalanced distribution of emotional states among different contexts, causing biased visual representation learning. From a causal demystification perspective, the harmful bias is identified as a confounder that misleads existing models to learn spurious correlations based on likelihood estimation, limiting the models' performance. To address the issue, we embrace causal inference to disentangle the models from the impact of such bias, and formulate the causalities among variables in the CAER task via a customized causal graph. Subsequently, we present a Contextual Causal Intervention Module (CCIM) to de-confound the confounder, which is built upon backdoor adjustment theory to facilitate seeking approximate causal effects during model training. As a plug-and-play component, CCIM can easily integrate with existing approaches and bring significant improvements. Systematic experiments on three datasets demonstrate the effectiveness of our CCIM.

Towards Context-Aware Emotion Recognition Debiasing from a Causal Demystification Perspective via De-confounded Training

TL;DR

. The method constructs a confounder dictionary

from masked contexts and employs NWGM-based approximation with attention over context prototypes to de-confound training. Empirical results on EMOTIC, CAER-S, and GroupWalk show consistent gains over strong baselines across discrete and continuous emotion measures, with ablations validating the components of CCIM. The approach advances unbiased emotion understanding in uncontrolled environments and offers a general framework for debiasing context-driven tasks in vision and multimodal learning.

Abstract

Paper Structure (35 sections, 9 equations, 12 figures, 7 tables)

This paper contains 35 sections, 9 equations, 12 figures, 7 tables.

Introduction
Related Work
Uni/Multimodal Emotion Recognition
Context-Aware Emotion Recognition
Causal Demystification
Methodology
Causal Perspective at CAER Task
Causal Intervention via Backdoor Adjustment
Context-Deconfounded Training with CCIM
Confounder Dictionary
Parameterization of the Proposed CCIM
Datasets and Evaluation Metrics
Implementation Details
Model Zoo
Confounder Construction
...and 20 more sections

Figures (12)

Figure 1: We provide several examples of emotion recognition in non-controlled scenarios. The red bounding boxes include the recognized subjects. (a) shows the ideal case of subject-centered emotion recognition, where previous efforts have extracted emotion-related semantics from available face, posture, and gesture information. (b) shows the common dilemma in the wild environment, where the subject's bodily regions are usually indistinguishable. In (c), it is difficult to recognize the emotion of the vague subject where the surrounding context is obscured. (d) shows complementary cues from the visible context around the subject that may reflect emotion, which is localized by the green bounding boxes.
Figure 2: The harmful context bias in the CAER task is intuitively demonstrated by randomly selecting sample examples in the training and testing sets of the EMOTIC dataset. GT represents the ground truth of samples. Most training samples containing vegetated surround contexts have similar positive emotion categories. In this case, the model kosti2019context relies on spurious correlations between specific contexts and emotion categories to learn misleading visual representations, causing entirely incorrect predictions. Thanks to the proposed CCIM, the model automatically corrects the prediction errors and gives more accurate results.
Figure 3: We present a preliminary toy experiment using the EMOTIC kosti2019context and CAER-S lee2019context datasets, focusing on scene categories associated with fear, anger, and happy emotions. The inclusion of more scene categories exhibiting normalized zero-conditional entropy reveals a pronounced presence of the harmful context bias.
Figure 4: Illustration of our CAER causal graph. (a) The conventional likelihood $P(\bm{Y}|\bm{X})$. (b) The causal intervention $P(\bm{Y}|do(\bm{X}))$.
Figure 5: We present a general pipeline for the context-deconfounded training in the CAER task. The pipeline can be adapted to most CAER models. Given an input image $\bm{x}$, two generalized coding functions $f_{s} (\cdot)$ and $f_{c} (\cdot)$ extract the subject feature $\bm{s}$ and context feature $\bm{c}$ from different regions, respectively. Subsequently, $\bm{s}$ and $\bm{c}$ are integrated and obtain the joint representation $\bm{h}$ through a fusion strategy whose specific implementation follows different methods. The red dotted box shows the core component: the proposed CCIM. Our CCIM is inserted before the task-specific classifier to reasonably approximate the causal intervention and assist the models in seeking the true causal effect during training.
...and 7 more figures

Towards Context-Aware Emotion Recognition Debiasing from a Causal Demystification Perspective via De-confounded Training

TL;DR

Abstract

Towards Context-Aware Emotion Recognition Debiasing from a Causal Demystification Perspective via De-confounded Training

Authors

TL;DR

Abstract

Table of Contents

Figures (12)