Temporal Label Hierachical Network for Compound Emotion Recognition
Sunan Li, Hailun Lian, Cheng Lu, Yan Zhao, Tianhua Qi, Hao Yang, Yuan Zong, Wenming Zheng
TL;DR
The work tackles compound emotion recognition in unconstrained environments where basic emotion categories are insufficient. It introduces a temporal pyramid network that combines a ResNet18 backbone with a Transformer to extract spatiotemporal features from three parallel 15-frame sequences, enabling robust frame-level predictions. An auxiliary valence/arousal classifier trained on the DFEW dataset guides a coarse-to-fine labeling strategy to map to compound emotions and mitigate data imbalance. Evaluated on ABAW7, the approach achieves promising average F1 across seven compound expressions, demonstrating the effectiveness of multi-scale temporal aggregation and hierarchical labeling for real-world emotion recognition.
Abstract
The emotion recognition has attracted more attention in recent decades. Although significant progress has been made in the recognition technology of the seven basic emotions, existing methods are still hard to tackle compound emotion recognition that occurred commonly in practical application. This article introduces our achievements in the 7th Field Emotion Behavior Analysis (ABAW) competition. In the competition, we selected pre trained ResNet18 and Transformer, which have been widely validated, as the basic network framework. Considering the continuity of emotions over time, we propose a time pyramid structure network for frame level emotion prediction. Furthermore. At the same time, in order to address the lack of data in composite emotion recognition, we utilized fine-grained labels from the DFEW database to construct training data for emotion categories in competitions. Taking into account the characteristics of valence arousal of various complex emotions, we constructed a classification framework from coarse to fine in the label space.
