Hierarchical Vision-Language Interaction for Facial Action Unit Detection

Yong Li; Yi Ren; Yizhe Zhang; Wenhua Zhang; Tianyi Zhang; Muyun Jiang; Guo-Sen Xie; Cuntai Guan

Hierarchical Vision-Language Interaction for Facial Action Unit Detection

Yong Li, Yi Ren, Yizhe Zhang, Wenhua Zhang, Tianyi Zhang, Muyun Jiang, Guo-Sen Xie, Cuntai Guan

TL;DR

This work proposes a Hierarchical Vision-language Interaction for AU Understanding (HiVA) method, which leverages textual AU descriptions as semantic priors to guide and enhance AU detection, and enables HiVA to leverage multi-grained vision-based AU features in conjunction with refined language-based AU details, culminating in robust and semantically enriched AU detection capabilities.

Abstract

Facial Action Unit (AU) detection seeks to recognize subtle facial muscle activations as defined by the Facial Action Coding System (FACS). A primary challenge w.r.t AU detection is the effective learning of discriminative and generalizable AU representations under conditions of limited annotated data. To address this, we propose a Hierarchical Vision-language Interaction for AU Understanding (HiVA) method, which leverages textual AU descriptions as semantic priors to guide and enhance AU detection. Specifically, HiVA employs a large language model to generate diverse and contextually rich AU descriptions to strengthen language-based representation learning. To capture both fine-grained and holistic vision-language associations, HiVA introduces an AU-aware dynamic graph module that facilitates the learning of AU-specific visual representations. These features are further integrated within a hierarchical cross-modal attention architecture comprising two complementary mechanisms: Disentangled Dual Cross-Attention (DDCA), which establishes fine-grained, AU-specific interactions between visual and textual features, and Contextual Dual Cross-Attention (CDCA), which models global inter-AU dependencies. This collaborative, cross-modal learning paradigm enables HiVA to leverage multi-grained vision-based AU features in conjunction with refined language-based AU details, culminating in robust and semantically enriched AU detection capabilities. Extensive experiments show that HiVA consistently surpasses state-of-the-art approaches. Besides, qualitative analyses reveal that HiVA produces semantically meaningful activation patterns, highlighting its efficacy in learning robust and interpretable cross-modal correspondences for comprehensive facial behavior analysis.

Hierarchical Vision-Language Interaction for Facial Action Unit Detection

TL;DR

Abstract

Paper Structure (15 sections, 7 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 15 sections, 7 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Method
Framework Overview
Image and Textual AU Feature Encoding
AU-Aware Dynamic Graph Construction
Hierarchical Vision-Language Interaction
DDCA
CDCA
Training Strategy.
Experiments
Comparison with the state-of-the-art methods
Analysis of Each Component in HiVA
Conclusion
Acknowledgment

Figures (7)

Figure 1: Overview of the proposed Hierarchical Vision-Language Interaction for Facial AU detection. (a) Illustrates the process of leveraging a large language model (e.g., GPT-4) to generate diverse and semantically rich AU descriptions, addressing the limitations of scarce textual knowledge. (b) depicts the core framework, which integrates visual features from input images and textual features from AU descriptions through hierarchical vision-language interaction, utilizing both local and global attention mechanisms for robust AU detection.
Figure 2: Framework of the proposed Hierarchical Vision-Language Attention for AU Understanding (HiVA). HiVA consists of four main components: (a) AU-Oriented Visual Feature Encoding, which extracts AU-specific visual representations using a CNN and transformer backbone; (b) Language-Based AU Description Modeling, which encodes textual AU descriptions to capture both intra- and inter-AU semantics; (c) AU-Aware Dynamic Graph Construction, which builds AU-specific graphs based on visual similarity to model adaptive inter-AU relations; and (d) Hierarchical Vision-Language Interaction, which employs Disentangled Dual Cross-Attention (DDCA) and Contextual Dual Cross-Attention (CDCA) to interact visual and textual features at both local and global levels.
Figure 3: Illustration of the Hierarchical Vision-Language Interaction module in HiVA. This component integrates two complementary attention mechanisms: Disentangled Dual Cross-Attention (DDCA) for fine-grained one-to-one interaction between AU-specific visual features and corresponding textual descriptions, and Contextual Dual Cross-Attention (CDCA) between global visual features and all AU descriptions. This bidirectional, cross-modal interaction enables the model to capture both localized semantic grounding and global inter-AU dependencies for robust AU detection.
Figure 4: Comparison of per-AU F1 score improvements between the proposed HiVA and its vision-only baseline. HiVA consistently enhances AU detection performance, particularly for infrequently activated AUs. This trend highlights the effectiveness of language-based AU descriptions in improving feature robustness and modeling inter-AU dependencies via dual cross-modal attention mechanisms.
Figure 5: Visualization of the cosine similarity matrices among textual AU embeddings at different training epochs for HiVA (w/o $\mathcal{L}_{\text{diff}}$) (top) and HiVA (bottom). HiVA progressively learns more discriminative and diverse AU embeddings, while the variant without $\mathcal{L}_{\text{diff}}$ shows persistently high similarity across different AUs, indicating limited semantic differentiation.
...and 2 more figures

Hierarchical Vision-Language Interaction for Facial Action Unit Detection

TL;DR

Abstract

Hierarchical Vision-Language Interaction for Facial Action Unit Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)