Table of Contents
Fetching ...

Exploring Facial Expression Recognition through Semi-Supervised Pretraining and Temporal Modeling

Jun Yu, Zhihong Wei, Zhongpeng Cai, Gongpeng Zhao, Zerui Zhang, Yongqi Wang, Guochen Xie, Jichao Zhu, Wangyuan Zhu

TL;DR

A Semi-supervised learning technique is employed to generate expression category pseudo labels for unlabeled face data and a debiased feedback learning strategy is implemented to address the problem of category imbalance in the dataset and the possible data bias in semi-supervised learning.

Abstract

Facial Expression Recognition (FER) plays a crucial role in computer vision and finds extensive applications across various fields. This paper aims to present our approach for the upcoming 6th Affective Behavior Analysis in-the-Wild (ABAW) competition, scheduled to be held at CVPR2024. In the facial expression recognition task, The limited size of the FER dataset poses a challenge to the expression recognition model's generalization ability, resulting in subpar recognition performance. To address this problem, we employ a semi-supervised learning technique to generate expression category pseudo-labels for unlabeled face data. At the same time, we uniformly sampled the labeled facial expression samples and implemented a debiased feedback learning strategy to address the problem of category imbalance in the dataset and the possible data bias in semi-supervised learning. Moreover, to further compensate for the limitation and bias of features obtained only from static images, we introduced a Temporal Encoder to learn and capture temporal relationships between neighbouring expression image features. In the 6th ABAW competition, our method achieved outstanding results on the official validation set, a result that fully confirms the effectiveness and competitiveness of our proposed method.

Exploring Facial Expression Recognition through Semi-Supervised Pretraining and Temporal Modeling

TL;DR

A Semi-supervised learning technique is employed to generate expression category pseudo labels for unlabeled face data and a debiased feedback learning strategy is implemented to address the problem of category imbalance in the dataset and the possible data bias in semi-supervised learning.

Abstract

Facial Expression Recognition (FER) plays a crucial role in computer vision and finds extensive applications across various fields. This paper aims to present our approach for the upcoming 6th Affective Behavior Analysis in-the-Wild (ABAW) competition, scheduled to be held at CVPR2024. In the facial expression recognition task, The limited size of the FER dataset poses a challenge to the expression recognition model's generalization ability, resulting in subpar recognition performance. To address this problem, we employ a semi-supervised learning technique to generate expression category pseudo-labels for unlabeled face data. At the same time, we uniformly sampled the labeled facial expression samples and implemented a debiased feedback learning strategy to address the problem of category imbalance in the dataset and the possible data bias in semi-supervised learning. Moreover, to further compensate for the limitation and bias of features obtained only from static images, we introduced a Temporal Encoder to learn and capture temporal relationships between neighbouring expression image features. In the 6th ABAW competition, our method achieved outstanding results on the official validation set, a result that fully confirms the effectiveness and competitiveness of our proposed method.
Paper Structure (21 sections, 8 equations, 1 figure, 1 table)

This paper contains 21 sections, 8 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Framework Description. Our approach is mainly divided into a Spatial Pretraining phase and a Temporal Refine phase. (1). The goal of the Spatial Pretraining phase is to expand face expression data by mining large-scale unlabeled faces through a semi-supervised algorithm. (2). The goal of the Temporal Refine phase is to do temporal feature enhancement of the image features extracted by the student network in the first phase by means of a temporal encoder, so as to improve the accuracy of recognizing the dynamic facial expressions in the video.