Table of Contents
Fetching ...

MIDAS: Mixing Ambiguous Data with Soft Labels for Dynamic Facial Expression Recognition

Ryosuke Kawamura, Hideaki Hayashi, Noriko Takemura, Hajime Nagahara

TL;DR

MIDAS tackles ambiguity in dynamic facial expression recognition by augmenting training data with soft-label mixtures of video frames. By convexly combining frames from distinct clips and their soft emotion distributions, MIDAS extends mixup to the video domain and unknown true hard labels, formalized through a vicinal risk framework using a beta-distributed mixing ratio. Empirical results on the DFEW dataset show MIDAS surpasses state-of-the-art methods in both WAR and UAR, including improved performance on underrepresented emotions and cross-dataset generalization to AFEW. The findings suggest soft-label data augmentation with mixing is a robust strategy for real-world FER where annotator disagreements and temporal co-occurrence of emotions are common.

Abstract

Dynamic facial expression recognition (DFER) is an important task in the field of computer vision. To apply automatic DFER in practice, it is necessary to accurately recognize ambiguous facial expressions, which often appear in data in the wild. In this paper, we propose MIDAS, a data augmentation method for DFER, which augments ambiguous facial expression data with soft labels consisting of probabilities for multiple emotion classes. In MIDAS, the training data are augmented by convexly combining pairs of video frames and their corresponding emotion class labels, which can also be regarded as an extension of mixup to soft-labeled video data. This simple extension is remarkably effective in DFER with ambiguous facial expression data. To evaluate MIDAS, we conducted experiments on the DFEW dataset. The results demonstrate that the model trained on the data augmented by MIDAS outperforms the existing state-of-the-art method trained on the original dataset.

MIDAS: Mixing Ambiguous Data with Soft Labels for Dynamic Facial Expression Recognition

TL;DR

MIDAS tackles ambiguity in dynamic facial expression recognition by augmenting training data with soft-label mixtures of video frames. By convexly combining frames from distinct clips and their soft emotion distributions, MIDAS extends mixup to the video domain and unknown true hard labels, formalized through a vicinal risk framework using a beta-distributed mixing ratio. Empirical results on the DFEW dataset show MIDAS surpasses state-of-the-art methods in both WAR and UAR, including improved performance on underrepresented emotions and cross-dataset generalization to AFEW. The findings suggest soft-label data augmentation with mixing is a robust strategy for real-world FER where annotator disagreements and temporal co-occurrence of emotions are common.

Abstract

Dynamic facial expression recognition (DFER) is an important task in the field of computer vision. To apply automatic DFER in practice, it is necessary to accurately recognize ambiguous facial expressions, which often appear in data in the wild. In this paper, we propose MIDAS, a data augmentation method for DFER, which augments ambiguous facial expression data with soft labels consisting of probabilities for multiple emotion classes. In MIDAS, the training data are augmented by convexly combining pairs of video frames and their corresponding emotion class labels, which can also be regarded as an extension of mixup to soft-labeled video data. This simple extension is remarkably effective in DFER with ambiguous facial expression data. To evaluate MIDAS, we conducted experiments on the DFEW dataset. The results demonstrate that the model trained on the data augmented by MIDAS outperforms the existing state-of-the-art method trained on the original dataset.

Paper Structure

This paper contains 21 sections, 7 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Example of an ambiguous facial expression. The images were taken from the DFEW dataset jiang2020dfew. The bar chart in the bottom row shows the soft-labeled annotation constructed based on the proportions of votes by ten annotators. The annotations are split into four emotion classes.
  • Figure 2: Outline of the data mixing procedure in MIDAS. In MIDAS, the training data are augmented by convexly combining pairs of video frames and their corresponding emotion class labels. The mixing coefficient $\lambda$ is randomly generated from a beta distribution. The key point is that soft labels representing class probabilities are used instead of hard labels.
  • Figure 3: Examples of a clear facial expression (left) and ambiguous facial expression (right) with their soft label annotations in the DFEW dataset jiang2020dfew
  • Figure 4: Emotion class distribution of the DFEW dataset
  • Figure 5: Ratio of coexisting emotions for each emotion class. The values in the figure were calculated by averaging the soft label values of the samples that belong to the corresponding emotion class. The higher the value, the more likely the emotion class is to be voted for by the annotators simultaneously, that is, the more likely it is to coexist.
  • ...and 1 more figures