Table of Contents
Fetching ...

HSEmotion Team at the 7th ABAW Challenge: Multi-Task Learning and Compound Facial Expression Recognition

Andrey V. Savchenko

TL;DR

The paper tackles robust, privacy-preserving facial emotion analysis for ABAW-7 by proposing a lightweight, on-device pipeline built around frame-level features from multi-task pre-trained backbones (e.g., MT-EmotiEffNet, MT-EmotiDDAMFN, MT-EmotiMobileFaceNet, MT-EmotiMobileViT). A compact feed-forward head outputs $p_{VA}$, $p_{EXPR}$, and $p_{AU}$, with a slice layer ensuring $V$ and $A$ use suitable inputs, and simple blending of the top models alongside time-domain post-processing using box or Gaussian filters. The approach delivers substantial gains over baselines on validation, achieving $P_{MTL}$ up to about $1.49$ and up to $1.25 imes$ improvement in VA CCC, while CE recognition advances to a test F1 around $0.3146$ with box-filter post-processing, illustrating that lightweight architectures paired with pragmatic post-processing can rival heavier ensembles under privacy constraints. These results promote practical, mobile-friendly affective analysis in unconstrained environments and lay groundwork for further gains via additional pre-trained backbones and multimodal cues, without sacrificing privacy.

Abstract

In this paper, we describe the results of the HSEmotion team in two tasks of the seventh Affective Behavior Analysis in-the-wild (ABAW) competition, namely, multi-task learning for simultaneous prediction of facial expression, valence, arousal, and detection of action units, and compound expression recognition. We propose an efficient pipeline based on frame-level facial feature extractors pre-trained in multi-task settings to estimate valence-arousal and basic facial expressions given a facial photo. We ensure the privacy-awareness of our techniques by using the lightweight architectures of neural networks, such as MT-EmotiDDAMFN, MT-EmotiEffNet, and MT-EmotiMobileFaceNet, that can run even on a mobile device without the need to send facial video to a remote server. It was demonstrated that a significant step in improving the overall accuracy is the smoothing of neural network output scores using Gaussian or box filters. It was experimentally demonstrated that such a simple post-processing of predictions from simple blending of two top visual models improves the F1-score of facial expression recognition up to 7%. At the same time, the mean Concordance Correlation Coefficient (CCC) of valence and arousal is increased by up to 1.25 times compared to each model's frame-level predictions. As a result, our final performance score on the validation set from the multi-task learning challenge is 4.5 times higher than the baseline (1.494 vs 0.32).

HSEmotion Team at the 7th ABAW Challenge: Multi-Task Learning and Compound Facial Expression Recognition

TL;DR

The paper tackles robust, privacy-preserving facial emotion analysis for ABAW-7 by proposing a lightweight, on-device pipeline built around frame-level features from multi-task pre-trained backbones (e.g., MT-EmotiEffNet, MT-EmotiDDAMFN, MT-EmotiMobileFaceNet, MT-EmotiMobileViT). A compact feed-forward head outputs , , and , with a slice layer ensuring and use suitable inputs, and simple blending of the top models alongside time-domain post-processing using box or Gaussian filters. The approach delivers substantial gains over baselines on validation, achieving up to about and up to improvement in VA CCC, while CE recognition advances to a test F1 around with box-filter post-processing, illustrating that lightweight architectures paired with pragmatic post-processing can rival heavier ensembles under privacy constraints. These results promote practical, mobile-friendly affective analysis in unconstrained environments and lay groundwork for further gains via additional pre-trained backbones and multimodal cues, without sacrificing privacy.

Abstract

In this paper, we describe the results of the HSEmotion team in two tasks of the seventh Affective Behavior Analysis in-the-wild (ABAW) competition, namely, multi-task learning for simultaneous prediction of facial expression, valence, arousal, and detection of action units, and compound expression recognition. We propose an efficient pipeline based on frame-level facial feature extractors pre-trained in multi-task settings to estimate valence-arousal and basic facial expressions given a facial photo. We ensure the privacy-awareness of our techniques by using the lightweight architectures of neural networks, such as MT-EmotiDDAMFN, MT-EmotiEffNet, and MT-EmotiMobileFaceNet, that can run even on a mobile device without the need to send facial video to a remote server. It was demonstrated that a significant step in improving the overall accuracy is the smoothing of neural network output scores using Gaussian or box filters. It was experimentally demonstrated that such a simple post-processing of predictions from simple blending of two top visual models improves the F1-score of facial expression recognition up to 7%. At the same time, the mean Concordance Correlation Coefficient (CCC) of valence and arousal is increased by up to 1.25 times compared to each model's frame-level predictions. As a result, our final performance score on the validation set from the multi-task learning challenge is 4.5 times higher than the baseline (1.494 vs 0.32).
Paper Structure (8 sections, 3 equations, 7 figures, 9 tables)

This paper contains 8 sections, 3 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Proposed pipeline
  • Figure 2: Dependence of the average CCC of VA prediction on smoothing variance $\sigma^2$: (a) MT-EmotiDDAMFN, (b) MT-EmotiEffNet-B0.
  • Figure 3: Dependence of the F1-score of EXPR classification on smoothing variance $\sigma^2$: (a) MT-EmotiDDAMFN, (b) MT-EmotiEffNet-B0.
  • Figure 4: Dependence of the F1-score of AU detection on smoothing variance $\sigma^2$: (a) MT-EmotiDDAMFN, (b) MT-EmotiEffNet-B0.
  • Figure 5: Dependence of blending results on smoothing variance $\sigma^2$: (a) EXPR classification, (b) VA prediction.
  • ...and 2 more figures