Table of Contents
Fetching ...

QCS: Feature Refining from Quadruplet Cross Similarity for Facial Expression Recognition

Chengpeng Wang, Li Chen, Lili Wang, Zhaofan Li, Xuebin Lv

TL;DR

This work tackles facial expression recognition by disentangling discriminative expression features from unlabeled interference. It introduces Cross Similarity Attention (CSA), a global spatial attention mechanism designed to mine fine-grained similarities across image pairs, and integrates it within Quadruplet Cross Similarity (QCS), a four-branch centrally symmetric network that enables stable joint training. A Contrastive Residual Distillation scheme transfers knowledge from the cross modules back to a single inference-time base branch, preserving efficiency. Empirical results on RAF-DB, FERPlus, and AffectNet show state-of-the-art performance without relying on landmark information, with ablations validating the importance of CSA, residual connections, and intra-/inter-class refinement. The approach offers a principled way to refine FER features and could inform broader applications of cross-image contrastive refinement in vision tasks.

Abstract

Facial expression recognition faces challenges where labeled significant features in datasets are mixed with unlabeled redundant ones. In this paper, we introduce Cross Similarity Attention (CSA) to mine richer intrinsic information from image pairs, overcoming a limitation when the Scaled Dot-Product Attention of ViT is directly applied to calculate the similarity between two different images. Based on CSA, we simultaneously minimize intra-class differences and maximize inter-class differences at the fine-grained feature level through interactions among multiple branches. Contrastive residual distillation is utilized to transfer the information learned in the cross module back to the base network. We ingeniously design a four-branch centrally symmetric network, named Quadruplet Cross Similarity (QCS), which alleviates gradient conflicts arising from the cross module and achieves balanced and stable training. It can adaptively extract discriminative features while isolating redundant ones. The cross-attention modules exist during training, and only one base branch is retained during inference, resulting in no increase in inference time. Extensive experiments show that our proposed method achieves state-of-the-art performance on several FER datasets.

QCS: Feature Refining from Quadruplet Cross Similarity for Facial Expression Recognition

TL;DR

This work tackles facial expression recognition by disentangling discriminative expression features from unlabeled interference. It introduces Cross Similarity Attention (CSA), a global spatial attention mechanism designed to mine fine-grained similarities across image pairs, and integrates it within Quadruplet Cross Similarity (QCS), a four-branch centrally symmetric network that enables stable joint training. A Contrastive Residual Distillation scheme transfers knowledge from the cross modules back to a single inference-time base branch, preserving efficiency. Empirical results on RAF-DB, FERPlus, and AffectNet show state-of-the-art performance without relying on landmark information, with ablations validating the importance of CSA, residual connections, and intra-/inter-class refinement. The approach offers a principled way to refine FER features and could inform broader applications of cross-image contrastive refinement in vision tasks.

Abstract

Facial expression recognition faces challenges where labeled significant features in datasets are mixed with unlabeled redundant ones. In this paper, we introduce Cross Similarity Attention (CSA) to mine richer intrinsic information from image pairs, overcoming a limitation when the Scaled Dot-Product Attention of ViT is directly applied to calculate the similarity between two different images. Based on CSA, we simultaneously minimize intra-class differences and maximize inter-class differences at the fine-grained feature level through interactions among multiple branches. Contrastive residual distillation is utilized to transfer the information learned in the cross module back to the base network. We ingeniously design a four-branch centrally symmetric network, named Quadruplet Cross Similarity (QCS), which alleviates gradient conflicts arising from the cross module and achieves balanced and stable training. It can adaptively extract discriminative features while isolating redundant ones. The cross-attention modules exist during training, and only one base branch is retained during inference, resulting in no increase in inference time. Extensive experiments show that our proposed method achieves state-of-the-art performance on several FER datasets.

Paper Structure

This paper contains 32 sections, 10 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: We often assume labeled features exhibit significant and dominant distributions within the dataset; however, unlabeled redundant features may impact this. Feature similarity can be leveraged to distinguish target features from redundant ones, based on whether they originate from the same label group or different label groups.
  • Figure 2: The framework of Quadruplet Cross Similarity Network. Joint training is performed on 4 classifiers ${{Cls}_{base}}$ based on a weight-shared backbone and 4 classifiers ${{Cls}_{cross}}$ enhanced by cross module, with only the red branch retained during inference. Anchor and pos are in the same category, so are neg and neg2. The matrix ${SD}$ performs attention refinement on matrix ${S}$ and matrix ${D}$ by rows or columns.
  • Figure 3: Different types of attention on feature vectors h${\times}$w${\times}$c, where h=w=2, and c denotes channels. Blue and yellow represent features from different images, and each element corresponds to a 1${\times}$1${\times}$c vector. Double-arrowed lines denote the interaction values between corresponding two elements, and thicker lines indicate higher weights. The 4x4 green box in (d) represents the interaction matrix of (c).
  • Figure 4: Interaction matrix of 7${\times}$7${\times}$c (h=7, w=7, channel=c) image features in the spatial dimension (h${\times}$w=49).
  • Figure 5: The framework of Dual Cross Similarity Network.
  • ...and 8 more figures