Learning Contrastive Feature Representations for Facial Action Unit Detection

Ziqiao Shang; Bin Liu; Fengmao Lv; Fei Teng; Tianrui Li; Lan-Zhe Guo

Learning Contrastive Feature Representations for Facial Action Unit Detection

Ziqiao Shang, Bin Liu, Fengmao Lv, Fei Teng, Tianrui Li, Lan-Zhe Guo

TL;DR

This work tackles facial action unit detection under two core challenges: severe per-AU class imbalance and noisy/noisy AU annotations. It introduces AUNCE, a discriminative contrastive learning loss that blends self-supervised and supervised signals to emphasize differential AU information rather than full-face pixel cues. AUNCE incorporates a negative sample re-weighting scheme to prioritize minority AUs and a four-type positive sample sampling strategy to mitigate label noise, including self-supervised signals and class centroids. Extensive experiments on BP4D, DISFA, BP4D+, GFT, and Aff-Wild2 demonstrate state-of-the-art performance and strong cross-dataset generalization, with ablations validating each component’s contribution. The approach offers a robust, efficient direction for AU detection in both constrained and in-the-wild settings, with public code available for replication.

Abstract

For the Facial Action Unit (AU) detection task, accurately capturing the subtle facial differences between distinct AUs is essential for reliable detection. Additionally, AU detection faces challenges from class imbalance and the presence of noisy or false labels, which undermine detection accuracy. In this paper, we introduce a novel contrastive learning framework aimed for AU detection that incorporates both self-supervised and supervised signals, thereby enhancing the learning of discriminative features for accurate AU detection. To tackle the class imbalance issue, we employ a negative sample re-weighting strategy that adjusts the step size of updating parameters for minority and majority class samples. Moreover, to address the challenges posed by noisy and false AU labels, we employ a sampling technique that encompasses three distinct types of positive sample pairs. This enables us to inject self-supervised signals into the supervised signal, effectively mitigating the adverse effects of noisy labels. Our experimental assessments, conducted on five widely-utilized benchmark datasets (BP4D, DISFA, BP4D+, GFT and Aff-Wild2), underscore the superior performance of our approach compared to state-of-the-art methods of AU detection. Our code is available at https://github.com/Ziqiao-Shang/AUNCE.

Learning Contrastive Feature Representations for Facial Action Unit Detection

TL;DR

Abstract

Paper Structure (31 sections, 10 equations, 6 figures, 11 tables, 1 algorithm)

This paper contains 31 sections, 10 equations, 6 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Discriminative contrastive learning
Contrastive learning for AU detection
Method
Objectives
Discriminative contrastive learning framework
Negative sample re-weighting strategy
Positive sample sampling strategy
Supervised Signal
Self-supervised Signal
Encoder for AU detection
Evaluation
Experiment Setting
Datasets
...and 16 more sections

Figures (6)

Figure 1: Illustration of discriminative contrastive learning frameworks for representation learning. The goal is to maximize the similarity between positive samples (semantically similar) and minimize the similarity between negative samples (semantically dissimilar).
Figure 2: Overview of the training pipeline of our discriminative contrastive learning framework. The training process consists of two stages: pretraining and linear evaluation. The pretraining stage precedes the linear evaluation stage. In the first stage, the feature encoder is pretrained by the AUNCE loss. In the second stage, we adopt a linear evaluation protocol, consistent with the practice established in contrastive learning frameworks such as SimCLR. Specifically, a single linear fully connected layer is trained atop the frozen encoder to assess the quality of the learned feature representations.
Figure 3: Illustration of our proposed positive sample sampling strategy. It demonstrates the sampling strategy used to mitigate the impact of noisy labels by integrating self-supervised and supervised signals. Positive samples are categorized into three types and sampled with varying probabilities, enhancing the model's robustness to variations and errors in the labeled dataset.
Figure 4: Visualizing feature maps generated by models with and without the negative sample re-weighting strategy on BP4D dataset, which can help illustrate how the model with the negative sample re-weighting strategy concentrates more on the related facial regions. (a): Visualizing feature maps generated by model E (With the negative sample re-weighting strategy), and (b): Visualizing feature maps generated by model C (Without the negative sample re-weighting strategy).
Figure 5: T-SNE visualization of feature representations on BP4D and DISFA datasets. Colors indicates whether AU1 exists. Top row: (a) Representations optimized by WCE on BP4D dataset, (b)-(f) Visualizations of ablation study on BP4D dataset, respectively represent model A-E. Bottom row: (g) Representations optimized by AUNCE on DISFA dataset, (h)-(l) Visualizations of ablation study on DISFA dataset, respectively represent model A-E.
...and 1 more figures

Learning Contrastive Feature Representations for Facial Action Unit Detection

TL;DR

Abstract

Learning Contrastive Feature Representations for Facial Action Unit Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)