QCS: Feature Refining from Quadruplet Cross Similarity for Facial Expression Recognition
Chengpeng Wang, Li Chen, Lili Wang, Zhaofan Li, Xuebin Lv
TL;DR
This work tackles facial expression recognition by disentangling discriminative expression features from unlabeled interference. It introduces Cross Similarity Attention (CSA), a global spatial attention mechanism designed to mine fine-grained similarities across image pairs, and integrates it within Quadruplet Cross Similarity (QCS), a four-branch centrally symmetric network that enables stable joint training. A Contrastive Residual Distillation scheme transfers knowledge from the cross modules back to a single inference-time base branch, preserving efficiency. Empirical results on RAF-DB, FERPlus, and AffectNet show state-of-the-art performance without relying on landmark information, with ablations validating the importance of CSA, residual connections, and intra-/inter-class refinement. The approach offers a principled way to refine FER features and could inform broader applications of cross-image contrastive refinement in vision tasks.
Abstract
Facial expression recognition faces challenges where labeled significant features in datasets are mixed with unlabeled redundant ones. In this paper, we introduce Cross Similarity Attention (CSA) to mine richer intrinsic information from image pairs, overcoming a limitation when the Scaled Dot-Product Attention of ViT is directly applied to calculate the similarity between two different images. Based on CSA, we simultaneously minimize intra-class differences and maximize inter-class differences at the fine-grained feature level through interactions among multiple branches. Contrastive residual distillation is utilized to transfer the information learned in the cross module back to the base network. We ingeniously design a four-branch centrally symmetric network, named Quadruplet Cross Similarity (QCS), which alleviates gradient conflicts arising from the cross module and achieves balanced and stable training. It can adaptively extract discriminative features while isolating redundant ones. The cross-attention modules exist during training, and only one base branch is retained during inference, resulting in no increase in inference time. Extensive experiments show that our proposed method achieves state-of-the-art performance on several FER datasets.
