Rethinking the Learning Paradigm for Facial Expression Recognition
Weijie Wang, Bo Li, Nicu Sebe, Bruno Lepri
TL;DR
This work tackles the problem of ambiguous crowdsourced annotations in facial expression recognition by reframing FER as a Partial Label Learning task. It introduces a fully transformer-based architecture that combines Masked Image Modeling pretraining with a decoder-driven label disambiguation mechanism, guided by learnable label embeddings and a uniform embedding regularization to handle class imbalance. Empirically, the method achieves state-of-the-art results on RAF-DB, FERPlus, and AffectNet (7/8) by effectively leveraging annotation uncertainty and robust feature representations. The approach demonstrates the value of weak supervision and self-supervised pretraining for real-world FER and sets a path toward more robust, bias-aware recognition under ambiguous labeling scenarios.
Abstract
Due to the subjective crowdsourcing annotations and the inherent inter-class similarity of facial expressions, the real-world Facial Expression Recognition (FER) datasets usually exhibit ambiguous annotation. To simplify the learning paradigm, most previous methods convert ambiguous annotation results into precise one-hot annotations and train FER models in an end-to-end supervised manner. In this paper, we rethink the existing training paradigm and propose that it is better to use weakly supervised strategies to train FER models with original ambiguous annotation.
