AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors

Kaishen Yuan; Zitong Yu; Xin Liu; Weicheng Xie; Huanjing Yue; Jingyu Yang

AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors

Kaishen Yuan, Zitong Yu, Xin Liu, Weicheng Xie, Huanjing Yue, Jingyu Yang

TL;DR

AUFormer tackles AU detection under data-scarce conditions by freezing a Vision Transformer and introducing parameter-efficient Mixture-of-Knowledge Experts (MoKE) dedicated to each AU. MoKEs extract multi-scale and correlation knowledge via Multi-Receptive Field (MRF) and Context-Aware (CA) modules and collaborate within AU-specific groups to adapt ViT without full finetuning. A novel Margin-truncated Difficulty-aware Weighted Asymmetric Loss (MDWA-Loss) guides learning by emphasizing activated AUs and differentiating the difficulty of unactivated AUs while discarding potentially mislabeled samples. Across macro- and micro-expression benchmarks (BP4D, DISFA, CASME II), AUFormer achieves state-of-the-art or competitive results with superior data efficiency and generalization, illustrating the practicality of PETL for AU detection.

Abstract

Facial Action Units (AU) is a vital concept in the realm of affective computing, and AU detection has always been a hot research topic. Existing methods suffer from overfitting issues due to the utilization of a large number of learnable parameters on scarce AU-annotated datasets or heavy reliance on substantial additional relevant data. Parameter-Efficient Transfer Learning (PETL) provides a promising paradigm to address these challenges, whereas its existing methods lack design for AU characteristics. Therefore, we innovatively investigate PETL paradigm to AU detection, introducing AUFormer and proposing a novel Mixture-of-Knowledge Expert (MoKE) collaboration mechanism. An individual MoKE specific to a certain AU with minimal learnable parameters first integrates personalized multi-scale and correlation knowledge. Then the MoKE collaborates with other MoKEs in the expert group to obtain aggregated information and inject it into the frozen Vision Transformer (ViT) to achieve parameter-efficient AU detection. Additionally, we design a Margin-truncated Difficulty-aware Weighted Asymmetric Loss (MDWA-Loss), which can encourage the model to focus more on activated AUs, differentiate the difficulty of unactivated AUs, and discard potential mislabeled samples. Extensive experiments from various perspectives, including within-domain, cross-domain, data efficiency, and micro-expression domain, demonstrate AUFormer's state-of-the-art performance and robust generalization abilities without relying on additional relevant data. The code for AUFormer is available at https://github.com/yuankaishen2001/AUFormer.

AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors

TL;DR

Abstract

Paper Structure (12 sections, 9 equations, 9 figures, 5 tables)

This paper contains 12 sections, 9 equations, 9 figures, 5 tables.

Introduction
Related Work
Methodology
Preliminary
Collaboration Mechanism
Structure of MoKE
Loss Function
Experimental results
Settings
Comparison with State-of-the-Art Methods
Ablation Study
Conclusion and Future Work

Figures (9)

Figure 1: Comparison of AUFormer with fully fine-tuning, PETL paradigm, and the state-of-the-art methods on BP4D BP4D in terms of learnable parameters, FLOPs, and F1-score. The size of bubbles is proportional to the number of learnable parameters.
Figure 2: The overall architecture of the proposed AUFormer. MoKEs first extract essential multi-scale and correlation knowledge through MRF and CA operators. Then, the personalized features learned for each AU are integrated through an parameter-efficient intra-group collaboration mechanism for AU detection. We provide a detailed illustration of the $l$-th Transformer block.
Figure 3: Comparison between WCE-Loss, WA-Loss, and MWA-Loss
Figure 4: Comparison between AUs of different difficulty degrees
Figure 6: Comparison of data efficiency capabilities between AUFormer, ME-GraphAU, and KS. $^\dagger$These numbers are derived from our replication based on open-source code.
...and 4 more figures

AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors

TL;DR

Abstract

AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors

Authors

TL;DR

Abstract

Table of Contents

Figures (9)