Table of Contents
Fetching ...

Contrastive Learning of Person-independent Representations for Facial Action Unit Detection

Yong Li, Shiguang Shan

TL;DR

This work tackles the data scarcity in facial AU detection by proposing CLP, a self-supervised framework that learns frame-level AU representations from unlabeled videos. It combines temporal contrastive learning within short video clips (TCL) with cross-identity reconstruction (CIR) to enforce AU discriminativeness and identity invariance, respectively, using a memory-bank-based dictionary and a momentum encoder. Empirical results on BP4D, DISFA, and GFT show that CLP yields discriminative AU representations, beating standard self-supervised baselines and closely approaching supervised AU detection performance, with additional evidence of generalization to FER datasets. The approach highlights the practicality of leveraging large-scale unlabeled video data to curb annotation costs in AU analysis and suggests future enhancements with transformer-based modeling of AU interactions.

Abstract

Facial action unit (AU) detection, aiming to classify AU present in the facial image, has long suffered from insufficient AU annotations. In this paper, we aim to mitigate this data scarcity issue by learning AU representations from a large number of unlabelled facial videos in a contrastive learning paradigm. We formulate the self-supervised AU representation learning signals in two-fold: (1) AU representation should be frame-wisely discriminative within a short video clip; (2) Facial frames sampled from different identities but show analogous facial AUs should have consistent AU representations. As to achieve these goals, we propose to contrastively learn the AU representation within a video clip and devise a cross-identity reconstruction mechanism to learn the person-independent representations. Specially, we adopt a margin-based temporal contrastive learning paradigm to perceive the temporal AU coherence and evolution characteristics within a clip that consists of consecutive input facial frames. Moreover, the cross-identity reconstruction mechanism facilitates pushing the faces from different identities but show analogous AUs close in the latent embedding space. Experimental results on three public AU datasets demonstrate that the learned AU representation is discriminative for AU detection. Our method outperforms other contrastive learning methods and significantly closes the performance gap between the self-supervised and supervised AU detection approaches.

Contrastive Learning of Person-independent Representations for Facial Action Unit Detection

TL;DR

This work tackles the data scarcity in facial AU detection by proposing CLP, a self-supervised framework that learns frame-level AU representations from unlabeled videos. It combines temporal contrastive learning within short video clips (TCL) with cross-identity reconstruction (CIR) to enforce AU discriminativeness and identity invariance, respectively, using a memory-bank-based dictionary and a momentum encoder. Empirical results on BP4D, DISFA, and GFT show that CLP yields discriminative AU representations, beating standard self-supervised baselines and closely approaching supervised AU detection performance, with additional evidence of generalization to FER datasets. The approach highlights the practicality of leveraging large-scale unlabeled video data to curb annotation costs in AU analysis and suggests future enhancements with transformer-based modeling of AU interactions.

Abstract

Facial action unit (AU) detection, aiming to classify AU present in the facial image, has long suffered from insufficient AU annotations. In this paper, we aim to mitigate this data scarcity issue by learning AU representations from a large number of unlabelled facial videos in a contrastive learning paradigm. We formulate the self-supervised AU representation learning signals in two-fold: (1) AU representation should be frame-wisely discriminative within a short video clip; (2) Facial frames sampled from different identities but show analogous facial AUs should have consistent AU representations. As to achieve these goals, we propose to contrastively learn the AU representation within a video clip and devise a cross-identity reconstruction mechanism to learn the person-independent representations. Specially, we adopt a margin-based temporal contrastive learning paradigm to perceive the temporal AU coherence and evolution characteristics within a clip that consists of consecutive input facial frames. Moreover, the cross-identity reconstruction mechanism facilitates pushing the faces from different identities but show analogous AUs close in the latent embedding space. Experimental results on three public AU datasets demonstrate that the learned AU representation is discriminative for AU detection. Our method outperforms other contrastive learning methods and significantly closes the performance gap between the self-supervised and supervised AU detection approaches.
Paper Structure (18 sections, 6 equations, 9 figures, 6 tables)

This paper contains 18 sections, 6 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Main idea of our proposed contrastively learning the person-independent representations for AU detection method (CLP). CLP learns the frame-wisely discriminative AU representations via temporally contrastive learning. To further remove the person-specific nuisances, CLP exploits cross-identity reconstruction mechanism to push the faces from different identities but show consistent AUs close in the latent embedding space, thus encoding the inter-identity consistency in the representations.
  • Figure 2: The framework of CLP. To make the representations distinctive within a short video clip, we randomly sample a temporally consecutive sequence and construct a set of triplets according to the temporal interval for intra-video contrastive learning (Sec. \ref{['sec:temporal_contrastive_learning']}). Besides, we exploit the cross-identity reconstruction (CIR) mechanism to make the representations consistent for faces of different subjects that show analogous AUs. In CIR, we reconstruct the soft nearest neighbour $\hat{q}_{r}$ of $q_{r}$ from a dictionary, which will be described in Sec. \ref{['sec:Cross_video_Cycle_Consistency']}.
  • Figure 3: An example facial sequence with nine frames sampled from the Voxceleb2 dataset chung2018voxceleb2. We manually construct a set of triplets: $(f_{1}^a, f_{2}^p, f_{3}^n), (f_{1}^a, f_{3}^p, f_{4}^n), \cdots, (f_{1}^a, f_{8}^p, f_{9}^n)$. This intra-video temporal contrastive learning paradigm naturally learns to classify the frames based on their temporal distance from an anchor frame. Another set of triplets can be constructed in the reversed temporal order.
  • Figure 4: Illustration of the relationship between $\mathbf{C}$ and $\overline{\mathbf{C}}$.
  • Figure 5: Feature visualization on BP4D dataset. Top row: colors indicates whether AU12 exists. Bottom row: colors means the subjects. It is clear that CLP-learned representations are more invariant w.r.t subjects.
  • ...and 4 more figures