Towards End-to-End Explainable Facial Action Unit Recognition via Vision-Language Joint Learning

Xuri Ge; Junchen Fu; Fuhai Chen; Shan An; Nicu Sebe; Joemon M. Jose

Towards End-to-End Explainable Facial Action Unit Recognition via Vision-Language Joint Learning

Xuri Ge, Junchen Fu, Fuhai Chen, Shan An, Nicu Sebe, Joemon M. Jose

TL;DR

This work tackles the lack of explainability in facial action unit recognition by introducing VL-FAU, an end-to-end vision-language framework that jointly learns FAU states and generates interpretable language descriptions. It combines multi-scale visual representations with a dual-language supervision scheme: local language generation for each AU and global language generation for the whole face, guided by a dual-level AU refinement mechanism (DAIR). The model achieves state-of-the-art or competitive results on DISFA and BP4D while offering explicit explanations of predictions through natural language, and it remains computationally efficient by using a lightweight language module instead of large LLMs. These contributions advance practical FAU analysis by providing both high accuracy and interpretable, human-readable justifications for AU decisions, with potential for broader multimodal explainability in facial analysis.

Abstract

Facial action units (AUs), as defined in the Facial Action Coding System (FACS), have received significant research interest owing to their diverse range of applications in facial state analysis. Current mainstream FAU recognition models have a notable limitation, i.e., focusing only on the accuracy of AU recognition and overlooking explanations of corresponding AU states. In this paper, we propose an end-to-end Vision-Language joint learning network for explainable FAU recognition (termed VL-FAU), which aims to reinforce AU representation capability and language interpretability through the integration of joint multimodal tasks. Specifically, VL-FAU brings together language models to generate fine-grained local muscle descriptions and distinguishable global face description when optimising FAU recognition. Through this, the global facial representation and its local AU representations will achieve higher distinguishability among different AUs and different subjects. In addition, multi-level AU representation learning is utilised to improve AU individual attention-aware representation capabilities based on multi-scale combined facial stem feature. Extensive experiments on DISFA and BP4D AU datasets show that the proposed approach achieves superior performance over the state-of-the-art methods on most of the metrics. In addition, compared with mainstream FAU recognition methods, VL-FAU can provide local- and global-level interpretability language descriptions with the AUs' predictions.

Towards End-to-End Explainable Facial Action Unit Recognition via Vision-Language Joint Learning

TL;DR

Abstract

Paper Structure (16 sections, 12 equations, 5 figures, 4 tables)

This paper contains 16 sections, 12 equations, 5 figures, 4 tables.

Introduction
Related Work
Approach
Multi-level AU Representation Learning
Global Facial Representation Extraction
Dual-level AU Individual Refinement
Auxiliary Supervision with Local and Global Language Generation
Global Language Generation
Local Language Generation.
Vision-Language Joint Learning
Experiments
Dataset and Implementation Details
State-of-the-art Comparisons
Ablation Studies
Visualization of Results
...and 1 more sections

Figures (5)

Figure 1: Comparative analysis of FAU recognition paradigms is shown between conventional methods and our VL-FAU . While the mainstream methods provide direct predictions of AU activation states (orange stream), the VL-FAU model not only offers activation predictions but also provides detailed local and global descriptions of the corresponding AUs in natural language.
Figure 2: The overall end-to-end architecture of the proposed VL-FAU for explainable facial AU recognition. Given one face image, the multi-scale combined facial representation is extracted based on a pre-trained Swin-Transformer. VL-FAU is based on the multi-branch network containing multiple independent AU recognition branches as well as a global language generation branch. Each independent AU recognition branch owns a dual-level AU individual refinement module (DAIR) for individual AU attention-aware mining and a local AU language generation module for explicit semantic auxiliary supervision to improve the inter-AU distinguishability. A global language generation based on the multi-scale facial representation is leveraged to preserve shared stem feature diversity via multiple facial state foci. Finally, the multi-branch AU refined representations are stacked for multi-label classification with local and global language auxiliary supervisions (best viewed in color).
Figure 3: Multi-Label Performance Balancing Analysis. X-axis and Y-axis denote the variances of multi-label F1 scores on BP4D and DISFA, respectively. Circle size indicates the total relative performance improvement (%) compared with JAA-Net on BP4D and DISFA.
Figure 4: t-SNE visualization of the baseline model (w/o local and global language generation auxiliary) and full VL-FAU model on BP4D.
Figure 5: Visualizations of our proposed explainable facial AU recognition (VL-FAU) with explicit local and global language descriptions on BP4D.

Towards End-to-End Explainable Facial Action Unit Recognition via Vision-Language Joint Learning

TL;DR

Abstract

Towards End-to-End Explainable Facial Action Unit Recognition via Vision-Language Joint Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)