Table of Contents
Fetching ...

V-NAW: Video-based Noise-aware Adaptive Weighting for Facial Expression Recognition

JunGyu Lee, Kunyoung Lee, Haesol Park, Ig-Jae Kim, Gi Pyo Nam

TL;DR

This paper addresses video-based facial expression recognition under label noise and class imbalance, leveraging a temporal Transformer and a novel frame-level augmentation strategy to reduce spatiotemporal redundancy. It introduces Video-based Noise-aware Adaptive Weighting (V-NAW) which weights clip-level predictions by a Gaussian-based uncertainty measure, and combines frame skipping with in-frame erasing to improve generalization. On Aff-Wild2, the approach significantly outperforms a strong baseline, with ablations showing additive gains from augmentation and NAW, and demonstrates that purely vision-based features can achieve competitive performance. The work highlights the importance of handling annotation ambiguity and redundancy in video FER and provides a practical, modality-light pathway for robust affective analysis.

Abstract

Facial Expression Recognition (FER) plays a crucial role in human affective analysis and has been widely applied in computer vision tasks such as human-computer interaction and psychological assessment. The 8th Affective Behavior Analysis in-the-Wild (ABAW) Challenge aims to assess human emotions using the video-based Aff-Wild2 dataset. This challenge includes various tasks, including the video-based EXPR recognition track, which is our primary focus. In this paper, we demonstrate that addressing label ambiguity and class imbalance, which are known to cause performance degradation, can lead to meaningful performance improvements. Specifically, we propose Video-based Noise-aware Adaptive Weighting (V-NAW), which adaptively assigns importance to each frame in a clip to address label ambiguity and effectively capture temporal variations in facial expressions. Furthermore, we introduce a simple and effective augmentation strategy to reduce redundancy between consecutive frames, which is a primary cause of overfitting. Through extensive experiments, we validate the effectiveness of our approach, demonstrating significant improvements in video-based FER performance.

V-NAW: Video-based Noise-aware Adaptive Weighting for Facial Expression Recognition

TL;DR

This paper addresses video-based facial expression recognition under label noise and class imbalance, leveraging a temporal Transformer and a novel frame-level augmentation strategy to reduce spatiotemporal redundancy. It introduces Video-based Noise-aware Adaptive Weighting (V-NAW) which weights clip-level predictions by a Gaussian-based uncertainty measure, and combines frame skipping with in-frame erasing to improve generalization. On Aff-Wild2, the approach significantly outperforms a strong baseline, with ablations showing additive gains from augmentation and NAW, and demonstrates that purely vision-based features can achieve competitive performance. The work highlights the importance of handling annotation ambiguity and redundancy in video FER and provides a practical, modality-light pathway for robust affective analysis.

Abstract

Facial Expression Recognition (FER) plays a crucial role in human affective analysis and has been widely applied in computer vision tasks such as human-computer interaction and psychological assessment. The 8th Affective Behavior Analysis in-the-Wild (ABAW) Challenge aims to assess human emotions using the video-based Aff-Wild2 dataset. This challenge includes various tasks, including the video-based EXPR recognition track, which is our primary focus. In this paper, we demonstrate that addressing label ambiguity and class imbalance, which are known to cause performance degradation, can lead to meaningful performance improvements. Specifically, we propose Video-based Noise-aware Adaptive Weighting (V-NAW), which adaptively assigns importance to each frame in a clip to address label ambiguity and effectively capture temporal variations in facial expressions. Furthermore, we introduce a simple and effective augmentation strategy to reduce redundancy between consecutive frames, which is a primary cause of overfitting. Through extensive experiments, we validate the effectiveness of our approach, demonstrating significant improvements in video-based FER performance.

Paper Structure

This paper contains 17 sections, 15 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The overview of our proposed method. First, we apply augmentation to the input video clip. Next, we extract frame-wise visual features using a pre-trained image encoder (blue box). These features are then aggregated at the clip level and processed by a temporal encoder to capture temporal information (pink box). Finally, we incorporate Noise-aware Adaptive Weighting (NAW) NLA