Towards A Robust Group-level Emotion Recognition via Uncertainty-Aware Learning

Qing Zhu; Qirong Mao; Jialin Zhang; Xiaohua Huang; Wenming Zheng

Towards A Robust Group-level Emotion Recognition via Uncertainty-Aware Learning

Qing Zhu, Qirong Mao, Jialin Zhang, Xiaohua Huang, Wenming Zheng

TL;DR

This work tackles group-level emotion recognition in unconstrained scenes by explicitly modeling uncertainty. It introduces Uncertainty-Aware Learning (UAL), mapping each individual to a Gaussian in latent space and using Monte Carlo sampling to generate diverse, uncertainty-informed representations ($P(z_n|x_n^I)=N(z_n;\mu_n,\sigma_n^2 I)$, with $z_n^* = \frac{1}{M}\sum_{m=1}^M(\mu_n+\epsilon_m\sigma_n)$). The model comprises three branches (face, object, scene) with an image-enhancement module, and uses a Proportional-Weighted Fusion Strategy (PWFS) to fuse branch predictions based on uncertainty-derived weights. Key contributions include uncertainty-sensitive scores for adaptive fusion, KL/rank/rec loss terms to stabilize training, a reconstruction-like penalty to curb variance oscillations, and extensive experiments across GAFF2, GAFF3, and MultiEmoVA demonstrating improved robustness and generalization. The approach advances GER by enabling robust, diverse representations under real-world uncertainties, with implications for reliable affective AI in crowded or noisy environments.

Abstract

Group-level emotion recognition (GER) is an inseparable part of human behavior analysis, aiming to recognize an overall emotion in a multi-person scene. However, the existing methods are devoted to combing diverse emotion cues while ignoring the inherent uncertainties under unconstrained environments, such as congestion and occlusion occurring within a group. Additionally, since only group-level labels are available, inconsistent emotion predictions among individuals in one group can confuse the network. In this paper, we propose an uncertainty-aware learning (UAL) method to extract more robust representations for GER. By explicitly modeling the uncertainty of each individual, we utilize stochastic embedding drawn from a Gaussian distribution instead of deterministic point embedding. This representation captures the probabilities of different emotions and generates diverse predictions through this stochasticity during the inference stage. Furthermore, uncertainty-sensitive scores are adaptively assigned as the fusion weights of individuals' face within each group. Moreover, we develop an image enhancement module to enhance the model's robustness against severe noise. The overall three-branch model, encompassing face, object, and scene component, is guided by a proportional-weighted fusion strategy and integrates the proposed uncertainty-aware method to produce the final group-level output. Experimental results demonstrate the effectiveness and generalization ability of our method across three widely used databases.

Towards A Robust Group-level Emotion Recognition via Uncertainty-Aware Learning

TL;DR

, with

). The model comprises three branches (face, object, scene) with an image-enhancement module, and uses a Proportional-Weighted Fusion Strategy (PWFS) to fuse branch predictions based on uncertainty-derived weights. Key contributions include uncertainty-sensitive scores for adaptive fusion, KL/rank/rec loss terms to stabilize training, a reconstruction-like penalty to curb variance oscillations, and extensive experiments across GAFF2, GAFF3, and MultiEmoVA demonstrating improved robustness and generalization. The approach advances GER by enabling robust, diverse representations under real-world uncertainties, with implications for reliable affective AI in crowded or noisy environments.

Abstract

Paper Structure (16 sections, 15 equations, 6 figures, 9 tables)

This paper contains 16 sections, 15 equations, 6 figures, 9 tables.

Introduction
Related Work
Group-level Emotion Recognition
Learning with Uncertainties
Proposed method
Feature Extractor
Uncertainty Modeling
Image Enhancement Module
Proportional-weighted Fusion Strategy
Experimental and Discussion
Databases and Evaluation Metrics
Implementation Details
Comparison with the State-of-the-Art (SOTA) methods
Ablation Study
Visualization Analysis
...and 1 more sections

Figures (6)

Figure 1: Observation and Motivation: Low-quality examples in the GER database contain varying degrees of uncertain information. In (a), (b), and (c), a face is partially obscured due to being blocked by another individual in the same group, while in (b) and (d), faces experience self-occlusion. Robust emotion representations are necessary to assign lower weights to these face samples. Emotion predictions for individuals are ambiguous in both (a) and (d), and emotions of objects with the same semantic information vary in (a) and (c). These factors significantly impact the performance of GER.
Figure 2: The overview of our proposed method is depicted. The framework of the proposed method is illustrated in (a), incorporating face, object, and scene branches for GER and integrating the UAL module into the face and object branches. Notably, the face branch includes an image enhancement module. The proportional-weighted fusion combines the outputs of the three branches to provide the final group-level prediction. The UAL module is shown in (b-d). Uncertainty embedding (UE) in (b) represents each individual using stochastic embedding rather than the conventional point embedding. (c) and (d) correspond to the modeling of uncertainty with UE incorporated into the face and object branches, respectively.
Figure 3: Confusion matrices on the MultiEmoVA dataset. The values reflect classification accuracy for every category. “HighPos”, "MedPos"," HighNeg", "MedNeg", and “Neu” are abbreviations for “High-Positive”, "Medium-Positive", "High-Negative", "Medium-Negative", and “Neutral” in the group emotion labels for the MultiEmoVA dataset, respectively.
Figure 4: Impact of total sample time $M$ on GAFF2 database.
Figure 5: Illustration of face quality scores for face individuals within a group from image enhancement module. The threshold $\delta_{2}$ in Eq. \ref{['eqn:FIQE']} we set is 0.3, meaning that face individual samples with scores less than 0.3 are discarded in the operations of the image enhancement module.
...and 1 more figures

Towards A Robust Group-level Emotion Recognition via Uncertainty-Aware Learning

TL;DR

Abstract

Towards A Robust Group-level Emotion Recognition via Uncertainty-Aware Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)