Table of Contents
Fetching ...

MIMIC: Mask Image Pre-training with Mix Contrastive Fine-tuning for Facial Expression Recognition

Fan Zhang, Xiaobao Guo, Xiaojiang Peng, Alex Kot

TL;DR

MIMIC addresses the high cost and domain gap of pre-training for facial expression recognition by combining self-supervised masked image modeling on a mid-scale general dataset (ImageNet-1K) with a novel mix-supervised contrastive fine-tuning strategy. The two-stage approach enables the vanilla Vision Transformer to learn robust cross-domain representations without auxiliary modules, achieving competitive or state-of-the-art results on RAF-DB, FERPlus, and AffectNet, especially when scaling to ViT-L/16. Key contributions include demonstrating that ImageNet-1K-based MIM can outperform supervised pre-training on face datasets for FER, and introducing mix-based positive pairs to better capture inter-class similarity in FER. The work has practical impact by reducing data-label costs and providing a scalable, generalizable FER training paradigm suitable for large-scale or diverse deployment scenarios.

Abstract

Cutting-edge research in facial expression recognition (FER) currently favors the utilization of convolutional neural networks (CNNs) backbone which is supervisedly pre-trained on face recognition datasets for feature extraction. However, due to the vast scale of face recognition datasets and the high cost associated with collecting facial labels, this pre-training paradigm incurs significant expenses. Towards this end, we propose to pre-train vision Transformers (ViTs) through a self-supervised approach on a mid-scale general image dataset. In addition, when compared with the domain disparity existing between face datasets and FER datasets, the divergence between general datasets and FER datasets is more pronounced. Therefore, we propose a contrastive fine-tuning approach to effectively mitigate this domain disparity. Specifically, we introduce a novel FER training paradigm named Mask Image pre-training with MIx Contrastive fine-tuning (MIMIC). In the initial phase, we pre-train the ViT via masked image reconstruction on general images. Subsequently, in the fine-tuning stage, we introduce a mix-supervised contrastive learning process, which enhances the model with a more extensive range of positive samples by the mixing strategy. Through extensive experiments conducted on three benchmark datasets, we demonstrate that our MIMIC outperforms the previous training paradigm, showing its capability to learn better representations. Remarkably, the results indicate that the vanilla ViT can achieve impressive performance without the need for intricate, auxiliary-designed modules. Moreover, when scaling up the model size, MIMIC exhibits no performance saturation and is superior to the current state-of-the-art methods.

MIMIC: Mask Image Pre-training with Mix Contrastive Fine-tuning for Facial Expression Recognition

TL;DR

MIMIC addresses the high cost and domain gap of pre-training for facial expression recognition by combining self-supervised masked image modeling on a mid-scale general dataset (ImageNet-1K) with a novel mix-supervised contrastive fine-tuning strategy. The two-stage approach enables the vanilla Vision Transformer to learn robust cross-domain representations without auxiliary modules, achieving competitive or state-of-the-art results on RAF-DB, FERPlus, and AffectNet, especially when scaling to ViT-L/16. Key contributions include demonstrating that ImageNet-1K-based MIM can outperform supervised pre-training on face datasets for FER, and introducing mix-based positive pairs to better capture inter-class similarity in FER. The work has practical impact by reducing data-label costs and providing a scalable, generalizable FER training paradigm suitable for large-scale or diverse deployment scenarios.

Abstract

Cutting-edge research in facial expression recognition (FER) currently favors the utilization of convolutional neural networks (CNNs) backbone which is supervisedly pre-trained on face recognition datasets for feature extraction. However, due to the vast scale of face recognition datasets and the high cost associated with collecting facial labels, this pre-training paradigm incurs significant expenses. Towards this end, we propose to pre-train vision Transformers (ViTs) through a self-supervised approach on a mid-scale general image dataset. In addition, when compared with the domain disparity existing between face datasets and FER datasets, the divergence between general datasets and FER datasets is more pronounced. Therefore, we propose a contrastive fine-tuning approach to effectively mitigate this domain disparity. Specifically, we introduce a novel FER training paradigm named Mask Image pre-training with MIx Contrastive fine-tuning (MIMIC). In the initial phase, we pre-train the ViT via masked image reconstruction on general images. Subsequently, in the fine-tuning stage, we introduce a mix-supervised contrastive learning process, which enhances the model with a more extensive range of positive samples by the mixing strategy. Through extensive experiments conducted on three benchmark datasets, we demonstrate that our MIMIC outperforms the previous training paradigm, showing its capability to learn better representations. Remarkably, the results indicate that the vanilla ViT can achieve impressive performance without the need for intricate, auxiliary-designed modules. Moreover, when scaling up the model size, MIMIC exhibits no performance saturation and is superior to the current state-of-the-art methods.
Paper Structure (17 sections, 10 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 17 sections, 10 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: The relationship of three domains. There exists a higher domain disparity between FER and general image classification.
  • Figure 2: Illustration of our pipeline. The encoder is firstly pre-trained by mask image modeling in other domains (e.g., ImageNet). Then we mix the augmented FER images and send them to the encoder. After the encoder, a projection head and a classification head are utilized to simultaneously refine visual representations learned from other domains and recognize facial expressions.
  • Figure 3: Comparison of three formats of contrastive learning. Self-supervised format ignores intra-class similarity, while supervised contrastive format leverages class labels to attract intra-class samples. However, the supervised contrastive format only focuses on intra-class similarity and ignores inter-class similarity. Mix-supervised contrastive format selects diverse positive samples, for inter-class samples with high similarities may boost visual representations as well.
  • Figure 4: An illustration of our mixing strategy. Sa and Su denotes Sadness and Surprise, respectively.
  • Figure 5: Comparison with other pre-training methods. All the methods are pre-trained on ImageNet-1k and then fine-tuned on FER datasets. Our method outperforms supervised, MIM (e.g., MAE, LocalMIM), and contrastive learning (e.g., MoCo V3) methods.
  • ...and 2 more figures