MIMIC: Mask Image Pre-training with Mix Contrastive Fine-tuning for Facial Expression Recognition
Fan Zhang, Xiaobao Guo, Xiaojiang Peng, Alex Kot
TL;DR
MIMIC addresses the high cost and domain gap of pre-training for facial expression recognition by combining self-supervised masked image modeling on a mid-scale general dataset (ImageNet-1K) with a novel mix-supervised contrastive fine-tuning strategy. The two-stage approach enables the vanilla Vision Transformer to learn robust cross-domain representations without auxiliary modules, achieving competitive or state-of-the-art results on RAF-DB, FERPlus, and AffectNet, especially when scaling to ViT-L/16. Key contributions include demonstrating that ImageNet-1K-based MIM can outperform supervised pre-training on face datasets for FER, and introducing mix-based positive pairs to better capture inter-class similarity in FER. The work has practical impact by reducing data-label costs and providing a scalable, generalizable FER training paradigm suitable for large-scale or diverse deployment scenarios.
Abstract
Cutting-edge research in facial expression recognition (FER) currently favors the utilization of convolutional neural networks (CNNs) backbone which is supervisedly pre-trained on face recognition datasets for feature extraction. However, due to the vast scale of face recognition datasets and the high cost associated with collecting facial labels, this pre-training paradigm incurs significant expenses. Towards this end, we propose to pre-train vision Transformers (ViTs) through a self-supervised approach on a mid-scale general image dataset. In addition, when compared with the domain disparity existing between face datasets and FER datasets, the divergence between general datasets and FER datasets is more pronounced. Therefore, we propose a contrastive fine-tuning approach to effectively mitigate this domain disparity. Specifically, we introduce a novel FER training paradigm named Mask Image pre-training with MIx Contrastive fine-tuning (MIMIC). In the initial phase, we pre-train the ViT via masked image reconstruction on general images. Subsequently, in the fine-tuning stage, we introduce a mix-supervised contrastive learning process, which enhances the model with a more extensive range of positive samples by the mixing strategy. Through extensive experiments conducted on three benchmark datasets, we demonstrate that our MIMIC outperforms the previous training paradigm, showing its capability to learn better representations. Remarkably, the results indicate that the vanilla ViT can achieve impressive performance without the need for intricate, auxiliary-designed modules. Moreover, when scaling up the model size, MIMIC exhibits no performance saturation and is superior to the current state-of-the-art methods.
