Table of Contents
Fetching ...

Multi-scale Dynamic and Hierarchical Relationship Modeling for Facial Action Units Recognition

Zihan Wang, Siyang Song, Cheng Luo, Songhe Deng, Weicheng Xie, Linlin Shen

TL;DR

The paper tackles the challenge of recognizing facial Action Units (AUs) by explicitly modeling their hierarchical spatial relationships and multi-scale temporal dynamics. It introduces MDHR, comprising two modules: Multi-scale Facial Dynamic Modelling (MFD), which captures AU-related motion across multiple spatial scales with adaptive weighting, and Hierarchical Spatio-temporal AU Relationship Modelling (HSR), which learns local region-based AU dependencies and cross-regional AU interactions via a Graph Attention Network, followed by a Temporal Convolution Network for sequence prediction. The approach achieves state-of-the-art results on BP4D and DISFA, demonstrating that incorporating both multi-scale dynamics and hierarchical AU relationships yields significant performance gains over static and other spatio-temporal methods. This work advances AU recognition by providing a unified framework that respects the anatomical and temporal structure of facial movements, with practical implications for affective computing and related applications.

Abstract

Human facial action units (AUs) are mutually related in a hierarchical manner, as not only they are associated with each other in both spatial and temporal domains but also AUs located in the same/close facial regions show stronger relationships than those of different facial regions. While none of existing approach thoroughly model such hierarchical inter-dependencies among AUs, this paper proposes to comprehensively model multi-scale AU-related dynamic and hierarchical spatio-temporal relationship among AUs for their occurrences recognition. Specifically, we first propose a novel multi-scale temporal differencing network with an adaptive weighting block to explicitly capture facial dynamics across frames at different spatial scales, which specifically considers the heterogeneity of range and magnitude in different AUs' activation. Then, a two-stage strategy is introduced to hierarchically model the relationship among AUs based on their spatial distribution (i.e., local and cross-region AU relationship modelling). Experimental results achieved on BP4D and DISFA show that our approach is the new state-of-the-art in the field of AU occurrence recognition. Our code is publicly available at https://github.com/CVI-SZU/MDHR.

Multi-scale Dynamic and Hierarchical Relationship Modeling for Facial Action Units Recognition

TL;DR

The paper tackles the challenge of recognizing facial Action Units (AUs) by explicitly modeling their hierarchical spatial relationships and multi-scale temporal dynamics. It introduces MDHR, comprising two modules: Multi-scale Facial Dynamic Modelling (MFD), which captures AU-related motion across multiple spatial scales with adaptive weighting, and Hierarchical Spatio-temporal AU Relationship Modelling (HSR), which learns local region-based AU dependencies and cross-regional AU interactions via a Graph Attention Network, followed by a Temporal Convolution Network for sequence prediction. The approach achieves state-of-the-art results on BP4D and DISFA, demonstrating that incorporating both multi-scale dynamics and hierarchical AU relationships yields significant performance gains over static and other spatio-temporal methods. This work advances AU recognition by providing a unified framework that respects the anatomical and temporal structure of facial movements, with practical implications for affective computing and related applications.

Abstract

Human facial action units (AUs) are mutually related in a hierarchical manner, as not only they are associated with each other in both spatial and temporal domains but also AUs located in the same/close facial regions show stronger relationships than those of different facial regions. While none of existing approach thoroughly model such hierarchical inter-dependencies among AUs, this paper proposes to comprehensively model multi-scale AU-related dynamic and hierarchical spatio-temporal relationship among AUs for their occurrences recognition. Specifically, we first propose a novel multi-scale temporal differencing network with an adaptive weighting block to explicitly capture facial dynamics across frames at different spatial scales, which specifically considers the heterogeneity of range and magnitude in different AUs' activation. Then, a two-stage strategy is introduced to hierarchically model the relationship among AUs based on their spatial distribution (i.e., local and cross-region AU relationship modelling). Experimental results achieved on BP4D and DISFA show that our approach is the new state-of-the-art in the field of AU occurrence recognition. Our code is publicly available at https://github.com/CVI-SZU/MDHR.
Paper Structure (12 sections, 12 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 12 sections, 12 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: (a) hierarchical AU relationship; and (b) heterogeneous range and magnitude of different AUs' activation.
  • Figure 2: The pipeline of our MDHR, where $k$ is set to 1. The MFD module (Sec. \ref{['subsec:AU-dynamic']}) first computes facial dynamic at multiple spatial scales based on feature maps output from multiple backbone hidden layers and the output layer. Then, the HSR module (Sec. \ref{['subsec:spatio-temporal']}) then individually models the relationship among AUs located in the same and different facial regions (the Auxiliary branch is only used at the training phase to make AU combination for each facial region (upper facial region is used as an example in the figure)). Finally, a TCN is individually employed to process every AU feature's sequence of all the input $T$ frames.
  • Figure 3: Visualization of adaptive weight matrices learned by the MFD module. The weight matrices learned for feature maps of shallow layers (layer 1 and 2) emphasized subtle motions (e.g., subtle eyebrow and cheek motions), while large check and mouth movements are captured in deeper layers (layer 3 and 4).
  • Figure 4: Visualization of AU predictions under three HSR settings, where white solid and hollow dots denote activated and inactivated AUs. The green dotted circles denote the local AU relationship modelling, while the yellow lines/weights denote the graph edges describing the association between AUs. It can be observed that the local relationship modelling can effectively model dependencies between AUs in the same region to make better predictions (e.g., AU2 and AU26 in column 3), while additionally use cross-regional AU relationship modelling can further utilize the learned relationship cues to improve AU predictions in different facial regions (e.g., AU6 and AU9 in column 4).