HSEmotion Team at the 7th ABAW Challenge: Multi-Task Learning and Compound Facial Expression Recognition
Andrey V. Savchenko
TL;DR
The paper tackles robust, privacy-preserving facial emotion analysis for ABAW-7 by proposing a lightweight, on-device pipeline built around frame-level features from multi-task pre-trained backbones (e.g., MT-EmotiEffNet, MT-EmotiDDAMFN, MT-EmotiMobileFaceNet, MT-EmotiMobileViT). A compact feed-forward head outputs $p_{VA}$, $p_{EXPR}$, and $p_{AU}$, with a slice layer ensuring $V$ and $A$ use suitable inputs, and simple blending of the top models alongside time-domain post-processing using box or Gaussian filters. The approach delivers substantial gains over baselines on validation, achieving $P_{MTL}$ up to about $1.49$ and up to $1.25 imes$ improvement in VA CCC, while CE recognition advances to a test F1 around $0.3146$ with box-filter post-processing, illustrating that lightweight architectures paired with pragmatic post-processing can rival heavier ensembles under privacy constraints. These results promote practical, mobile-friendly affective analysis in unconstrained environments and lay groundwork for further gains via additional pre-trained backbones and multimodal cues, without sacrificing privacy.
Abstract
In this paper, we describe the results of the HSEmotion team in two tasks of the seventh Affective Behavior Analysis in-the-wild (ABAW) competition, namely, multi-task learning for simultaneous prediction of facial expression, valence, arousal, and detection of action units, and compound expression recognition. We propose an efficient pipeline based on frame-level facial feature extractors pre-trained in multi-task settings to estimate valence-arousal and basic facial expressions given a facial photo. We ensure the privacy-awareness of our techniques by using the lightweight architectures of neural networks, such as MT-EmotiDDAMFN, MT-EmotiEffNet, and MT-EmotiMobileFaceNet, that can run even on a mobile device without the need to send facial video to a remote server. It was demonstrated that a significant step in improving the overall accuracy is the smoothing of neural network output scores using Gaussian or box filters. It was experimentally demonstrated that such a simple post-processing of predictions from simple blending of two top visual models improves the F1-score of facial expression recognition up to 7%. At the same time, the mean Concordance Correlation Coefficient (CCC) of valence and arousal is increased by up to 1.25 times compared to each model's frame-level predictions. As a result, our final performance score on the validation set from the multi-task learning challenge is 4.5 times higher than the baseline (1.494 vs 0.32).
