Rethinking the Learning Paradigm for Facial Expression Recognition

Weijie Wang; Bo Li; Nicu Sebe; Bruno Lepri

Rethinking the Learning Paradigm for Facial Expression Recognition

Weijie Wang, Bo Li, Nicu Sebe, Bruno Lepri

TL;DR

This work tackles the problem of ambiguous crowdsourced annotations in facial expression recognition by reframing FER as a Partial Label Learning task. It introduces a fully transformer-based architecture that combines Masked Image Modeling pretraining with a decoder-driven label disambiguation mechanism, guided by learnable label embeddings and a uniform embedding regularization to handle class imbalance. Empirically, the method achieves state-of-the-art results on RAF-DB, FERPlus, and AffectNet (7/8) by effectively leveraging annotation uncertainty and robust feature representations. The approach demonstrates the value of weak supervision and self-supervised pretraining for real-world FER and sets a path toward more robust, bias-aware recognition under ambiguous labeling scenarios.

Abstract

Due to the subjective crowdsourcing annotations and the inherent inter-class similarity of facial expressions, the real-world Facial Expression Recognition (FER) datasets usually exhibit ambiguous annotation. To simplify the learning paradigm, most previous methods convert ambiguous annotation results into precise one-hot annotations and train FER models in an end-to-end supervised manner. In this paper, we rethink the existing training paradigm and propose that it is better to use weakly supervised strategies to train FER models with original ambiguous annotation.

Rethinking the Learning Paradigm for Facial Expression Recognition

TL;DR

Abstract

Paper Structure (29 sections, 5 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 29 sections, 5 equations, 11 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Facial Expression Recognition
Partial Label Learning
Masked Image Modeling
Proposed Methods
Problem Formulation
PLL with Pre-trained Feature Representation
Label Disambiguation
Label Embedding Regularization
Revision Confidence
Experiments
Datasets
Baselines and Experiment Setup
Details for pre-training and fine-tuning
...and 14 more sections

Figures (11)

Figure 1: Random image samples from the FERPlus dataset with potentially noisy labels (colored in red) introduced by simple, voting-based label conversion barsoum2016training from crowdsourcing results.
Figure 2: The overview of our framework. First, we use the MIM pre-trained ViT encoder as the backbone for representing the features of a 2D image. Second, we input the obtained feature from the left part to the transformer decoder with learnable label embedding. Third, we revise $(i-1)$-th confidence and update it to get $i$-th confidence. Finally, the loss is computed between logits and the $i$-th confidence.
Figure 3: We adopt Masked Image Modeling (MIM) for pre-training, which involves predicting the hog descriptor (HOG) with the randomly masked input. After obtaining the pre-trained model, we fine-tune it for the FER task within our framework.
Figure 4: Test accuracy during fine-tuning on AffectNet-7.
Figure 5: The confidence correctness ratio of relabeled data on AffectNet-7.
...and 6 more figures

Rethinking the Learning Paradigm for Facial Expression Recognition

TL;DR

Abstract

Rethinking the Learning Paradigm for Facial Expression Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (11)