Table of Contents
Fetching ...

Bridging the Gaps: Utilizing Unlabeled Face Recognition Datasets to Boost Semi-Supervised Facial Expression Recognition

Jie Song, Mengqiao He, Jinhua Feng, Bairong Shen

TL;DR

This work first performs face reconstruction pre-training on large-scale facial images without annotations to learn features of facial geometry and expression regions, followed by two-stage fine-tuning on FER datasets with limited labels to boost semi-supervised FER.

Abstract

In recent years, Facial Expression Recognition (FER) has gained increasing attention. Most current work focuses on supervised learning, which requires a large amount of labeled and diverse images, while FER suffers from the scarcity of large, diverse datasets and annotation difficulty. To address these problems, we focus on utilizing large unlabeled Face Recognition (FR) datasets to boost semi-supervised FER. Specifically, we first perform face reconstruction pre-training on large-scale facial images without annotations to learn features of facial geometry and expression regions, followed by two-stage fine-tuning on FER datasets with limited labels. In addition, to further alleviate the scarcity of labeled and diverse images, we propose a Mixup-based data augmentation strategy tailored for facial images, and the loss weights of real and virtual images are determined according to the intersection-over-union (IoU) of the faces in the two images. Experiments on RAF-DB, AffectNet, and FERPlus show that our method outperforms existing semi-supervised FER methods and achieves new state-of-the-art performance. Remarkably, with only 5%, 25% training sets,our method achieves 64.02% on AffectNet,and 88.23% on RAF-DB, which is comparable to fully supervised state-of-the-art methods. Codes will be made publicly available at https://github.com/zhelishisongjie/SSFER.

Bridging the Gaps: Utilizing Unlabeled Face Recognition Datasets to Boost Semi-Supervised Facial Expression Recognition

TL;DR

This work first performs face reconstruction pre-training on large-scale facial images without annotations to learn features of facial geometry and expression regions, followed by two-stage fine-tuning on FER datasets with limited labels to boost semi-supervised FER.

Abstract

In recent years, Facial Expression Recognition (FER) has gained increasing attention. Most current work focuses on supervised learning, which requires a large amount of labeled and diverse images, while FER suffers from the scarcity of large, diverse datasets and annotation difficulty. To address these problems, we focus on utilizing large unlabeled Face Recognition (FR) datasets to boost semi-supervised FER. Specifically, we first perform face reconstruction pre-training on large-scale facial images without annotations to learn features of facial geometry and expression regions, followed by two-stage fine-tuning on FER datasets with limited labels. In addition, to further alleviate the scarcity of labeled and diverse images, we propose a Mixup-based data augmentation strategy tailored for facial images, and the loss weights of real and virtual images are determined according to the intersection-over-union (IoU) of the faces in the two images. Experiments on RAF-DB, AffectNet, and FERPlus show that our method outperforms existing semi-supervised FER methods and achieves new state-of-the-art performance. Remarkably, with only 5%, 25% training sets,our method achieves 64.02% on AffectNet,and 88.23% on RAF-DB, which is comparable to fully supervised state-of-the-art methods. Codes will be made publicly available at https://github.com/zhelishisongjie/SSFER.

Paper Structure

This paper contains 25 sections, 11 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: The evolution of face recognition datasets and facial expression recognition datasets
  • Figure 2: The pipeline of our SSFER. It consists of three stages: (a) Self-Supervised Pre-training Stage, unlabeled images are first divided into patches and then 75$\%$ of them are randomly masked, the remaining 25$\%$ of the visible patches are fed into the ViT Encoder, and the output embedding together with the masked patches are fed into the ViT decoder for image reconstruction; (b) Supervised Fine-tuning Stage, convex combinations of images and labels are divided into patches and then fed into the ViT encoder and the MLP head, and then the FaceMix Loss is calculated by the predictions and the IOUs of the images; (c) Semi-Supervised Fine-tuning Stage, unlabeled images are fed into the student model after strong augmentation and into the teacher model after weak augmentation, with the teacher parameters updated by an exponential moving average of the student model. If the prediction confidence of the teacher model is higher than the threshold, the class with the highest confidence is used as pseudo-labels to calculate the cross-entropy loss with the student predictions; Notably, the ViT encoder in our SSFER framework is the vanilla ViT-Base without modifications.
  • Figure 3: Samples of different masking ratios. Red circles represent failure to reconstruct expression regions, green circles represent success to reconstruct expression regions.
  • Figure 4: Use Mixup to mix two facial images to construct a virtual sample.
  • Figure 5: Example of mixing images with different head angles.
  • ...and 4 more figures