Table of Contents
Fetching ...

SIT-FER: Integration of Semantic-, Instance-, Text-level Information for Semi-supervised Facial Expression Recognition

Sixian Ding, Xu Jiang, Zhongjing Du, Jiaqi Cui, Xinyi Zeng, Yan Wang

TL;DR

SIT-FER addresses the unreliability of semantic-level pseudo-labels in semi-supervised facial expression recognition by integrating semantic-, instance-, and text-level information. It introduces an Instance Memory Buffer and a multimodal supervision scheme, deriving a final pseudo-label as $\widehat{p}=\frac{p^{sem}+p^{text}+p^{ins}}{3}$ and optimizing a triad of losses $L_{total}=L_s+\lambda_1 L_t+\lambda_2 L_u$ to fuse labeled and unlabeled data effectively. Empirical results on RAF-DB, SFEW, and AffectNet show state-of-the-art performance, with significant gains in low-label regimes and occasional surpassing of fully supervised baselines; ablations confirm the contributions of MSSL and MIPG. The work advances SS-DFER by enabling robust pseudo-label generation through multi-level signals and multimodal supervision, reducing labeling demands while improving practical applicability in real-world FER tasks.

Abstract

Semi-supervised deep facial expression recognition (SS-DFER) has gained increasingly research interest due to the difficulty in accessing sufficient labeled data in practical settings. However, existing SS-DFER methods mainly utilize generated semantic-level pseudo-labels for supervised learning, the unreliability of which compromises their performance and undermines the practical utility. In this paper, we propose a novel SS-DFER framework that simultaneously incorporates semantic, instance, and text-level information to generate high-quality pseudo-labels. Specifically, for the unlabeled data, considering the comprehensive knowledge within the textual descriptions and instance representations, we respectively calculate the similarities between the facial vision features and the corresponding textual and instance features to obtain the probabilities at the text- and instance-level. Combining with the semantic-level probability, these three-level probabilities are elaborately aggregated to gain the final pseudo-labels. Furthermore, to enhance the utilization of one-hot labels for the labeled data, we also incorporate text embeddings excavated from textual descriptions to co-supervise model training, enabling facial visual features to exhibit semantic correlations in the text space. Experiments on three datasets demonstrate that our method significantly outperforms current state-of-the-art SS-DFER methods and even exceeds fully supervised baselines. The code will be available at https://github.com/PatrickStarL/SIT-FER.

SIT-FER: Integration of Semantic-, Instance-, Text-level Information for Semi-supervised Facial Expression Recognition

TL;DR

SIT-FER addresses the unreliability of semantic-level pseudo-labels in semi-supervised facial expression recognition by integrating semantic-, instance-, and text-level information. It introduces an Instance Memory Buffer and a multimodal supervision scheme, deriving a final pseudo-label as and optimizing a triad of losses to fuse labeled and unlabeled data effectively. Empirical results on RAF-DB, SFEW, and AffectNet show state-of-the-art performance, with significant gains in low-label regimes and occasional surpassing of fully supervised baselines; ablations confirm the contributions of MSSL and MIPG. The work advances SS-DFER by enabling robust pseudo-label generation through multi-level signals and multimodal supervision, reducing labeling demands while improving practical applicability in real-world FER tasks.

Abstract

Semi-supervised deep facial expression recognition (SS-DFER) has gained increasingly research interest due to the difficulty in accessing sufficient labeled data in practical settings. However, existing SS-DFER methods mainly utilize generated semantic-level pseudo-labels for supervised learning, the unreliability of which compromises their performance and undermines the practical utility. In this paper, we propose a novel SS-DFER framework that simultaneously incorporates semantic, instance, and text-level information to generate high-quality pseudo-labels. Specifically, for the unlabeled data, considering the comprehensive knowledge within the textual descriptions and instance representations, we respectively calculate the similarities between the facial vision features and the corresponding textual and instance features to obtain the probabilities at the text- and instance-level. Combining with the semantic-level probability, these three-level probabilities are elaborately aggregated to gain the final pseudo-labels. Furthermore, to enhance the utilization of one-hot labels for the labeled data, we also incorporate text embeddings excavated from textual descriptions to co-supervise model training, enabling facial visual features to exhibit semantic correlations in the text space. Experiments on three datasets demonstrate that our method significantly outperforms current state-of-the-art SS-DFER methods and even exceeds fully supervised baselines. The code will be available at https://github.com/PatrickStarL/SIT-FER.

Paper Structure

This paper contains 17 sections, 11 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: A sketch of the method we propose for generating matching targets: Previous works (a) have only considered similarity at the semantic-level. Since semantic-level information is not always reliable, our method (b) attempts to calculate matching targets by jointly combining instance-, text-, and semantic-level information.
  • Figure 2: Illustration of our SIT-FER model. $WA$ represents Weak Augmentation and $SA$ represents Strong Augmentation. For labeled data, we utilize the one-hot labels to constrain the network by $L_s$ and calculate the similarity between vision features and text features for co-supervising model training by $L_t$. As for the unlabeled data, we integrate semantic probability, text probability and instance probability as the final pseudo-label $\widehat{p}$ and update the model by the unsupervised consistency loss $L_u$. Additionally, unlabeled data with highly reliable predictions will be added to the Instance Memory Buffer which is initialized by labeled data. Buffer $\mathcal{B}$ stores the feature embeddings of images and buffer $\mathcal{L}$ stores their corresponding labels.
  • Figure 3: Diagram of our Multi-level Information Pseudo-label Generation method processing clear and ambiguous samples. If the predictions from semantic-, instance- and text-level are similar, the result pseudo-label will be much sharper and produce high confidence for some classes. When these three predictions are different, the result pseudo-label will be much smoother.
  • Figure 4: Evaluation of parameters $\alpha$ and $\gamma$ on RAF-DB.
  • Figure 5: Visualization of the superiority of the three-level probability within our SIT-FER. The boxes at the top of the image display the predicted labels along with their respective probabilities, while the ground truth are presented at the bottom. Clearly, using three-level probability for prediction improves the accuracy.