Table of Contents
Fetching ...

AU-vMAE: Knowledge-Guide Action Units Detection via Video Masked Autoencoder

Qiaoqiao Jin, Rui Shi, Yishun Dou, Bingbing Ni

TL;DR

AU-vMAE addresses the scarcity of labeled video data for facial action unit detection by leveraging large-scale unlabeled videos through a videoMAE pre-training pipeline. It then applies a fine-tuning network that performs video-level, frame-level, and patch-level FAU detection guided by intra-frame co-occurrence and inter-frame transition priors via finite state machines. The approach achieves state-of-the-art results on BP4D and DISFA, with notable F1-score gains and robust performance across input granularities, demonstrating the value of multi-level supervision and knowledge-guided temporal modeling. This work reduces dependence on labeled data and advances practical FAU detectors by effectively modeling spatiotemporal AU dynamics.

Abstract

Current Facial Action Unit (FAU) detection methods generally encounter difficulties due to the scarcity of labeled video training data and the limited number of training face IDs, which renders the trained feature extractor insufficient coverage for modeling the large diversity of inter-person facial structures and movements. To explicitly address the above challenges, we propose a novel video-level pre-training scheme by fully exploring the multi-label property of FAUs in the video as well as the temporal label consistency. At the heart of our design is a pre-trained video feature extractor based on the video-masked autoencoder together with a fine-tuning network that jointly completes the multi-level video FAUs analysis tasks, \emph{i.e.} integrating both video-level and frame-level FAU detections, thus dramatically expanding the supervision set from sparse FAUs annotations to ALL video frames including masked ones. Moreover, we utilize inter-frame and intra-frame AU pair state matrices as prior knowledge to guide network training instead of traditional Graph Neural Networks, for better temporal supervision. Our approach demonstrates substantial enhancement in performance compared to the existing state-of-the-art methods used in BP4D and DISFA FAUs datasets.

AU-vMAE: Knowledge-Guide Action Units Detection via Video Masked Autoencoder

TL;DR

AU-vMAE addresses the scarcity of labeled video data for facial action unit detection by leveraging large-scale unlabeled videos through a videoMAE pre-training pipeline. It then applies a fine-tuning network that performs video-level, frame-level, and patch-level FAU detection guided by intra-frame co-occurrence and inter-frame transition priors via finite state machines. The approach achieves state-of-the-art results on BP4D and DISFA, with notable F1-score gains and robust performance across input granularities, demonstrating the value of multi-level supervision and knowledge-guided temporal modeling. This work reduces dependence on labeled data and advances practical FAU detectors by effectively modeling spatiotemporal AU dynamics.

Abstract

Current Facial Action Unit (FAU) detection methods generally encounter difficulties due to the scarcity of labeled video training data and the limited number of training face IDs, which renders the trained feature extractor insufficient coverage for modeling the large diversity of inter-person facial structures and movements. To explicitly address the above challenges, we propose a novel video-level pre-training scheme by fully exploring the multi-label property of FAUs in the video as well as the temporal label consistency. At the heart of our design is a pre-trained video feature extractor based on the video-masked autoencoder together with a fine-tuning network that jointly completes the multi-level video FAUs analysis tasks, \emph{i.e.} integrating both video-level and frame-level FAU detections, thus dramatically expanding the supervision set from sparse FAUs annotations to ALL video frames including masked ones. Moreover, we utilize inter-frame and intra-frame AU pair state matrices as prior knowledge to guide network training instead of traditional Graph Neural Networks, for better temporal supervision. Our approach demonstrates substantial enhancement in performance compared to the existing state-of-the-art methods used in BP4D and DISFA FAUs datasets.
Paper Structure (24 sections, 9 equations, 6 figures, 4 tables)

This paper contains 24 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: AU labels exhibit spatial and temporal correlations due to expressions activating FAUs, inducing spatial relations, and changing musculature associating temporally distinct transitions.
  • Figure 2: The AU-vMAE framework comprises two primary modules: (a) A pre-trained model is developed by reconstructing facial videos with masks to extract video features. (b) This pre-trained encoder is then applied to three downstream subtasks: 1) video-level FAU detection which processes all frames of a video to predict AU labels for each frame; 2) frame-level FAU detection using equidistantly sampled frames to predict AU labels for the entire video, and 3) patch-level FAU detection where randomly masked frames are used to predict AUs frame-by-frame.
  • Figure 3: Co-occurrence matrix of DISFA DISFA dataset. The co-occurrence matrix displays the likelihood of two labels appearing together, with each element corresponding to a pair of AU labels.
  • Figure 4: Finite state machine (FSM) of two AU labels. At any moment $t$, two AU label pairs ($i$ and $j$) can have four possible states (00, 01, 10, and 11). The table presents 16 potential state transitions from time $t$ to time $t+1$. Their corresponding probabilities are visualized in the graph.
  • Figure 5: The distribution of the original AU label distribution and the augmented AU label distribution of BP4D and DISFA datasets.
  • ...and 1 more figures