Facial Action Unit Detection by Adaptively Constraining Self-Attention and Causally Deconfounding Sample

Zhiwen Shao; Hancheng Zhu; Yong Zhou; Xiang Xiang; Bing Liu; Rui Yao; Lizhuang Ma

Facial Action Unit Detection by Adaptively Constraining Self-Attention and Causally Deconfounding Sample

Zhiwen Shao, Hancheng Zhu, Yong Zhou, Xiang Xiang, Bing Liu, Rui Yao, Lizhuang Ma

TL;DR

The mechanism of self-attention weight distribution is explored, in which the self-attention weight distribution of each AU is regarded as spatial distribution and is adaptively learned under the constraint of location-predefined attention and the guidance of AU detection.

Abstract

Facial action unit (AU) detection remains a challenging task, due to the subtlety, dynamics, and diversity of AUs. Recently, the prevailing techniques of self-attention and causal inference have been introduced to AU detection. However, most existing methods directly learn self-attention guided by AU detection, or employ common patterns for all AUs during causal intervention. The former often captures irrelevant information in a global range, and the latter ignores the specific causal characteristic of each AU. In this paper, we propose a novel AU detection framework called AC2D by adaptively constraining self-attention weight distribution and causally deconfounding the sample confounder. Specifically, we explore the mechanism of self-attention weight distribution, in which the self-attention weight distribution of each AU is regarded as spatial distribution and is adaptively learned under the constraint of location-predefined attention and the guidance of AU detection. Moreover, we propose a causal intervention module for each AU, in which the bias caused by training samples and the interference from irrelevant AUs are both suppressed. Extensive experiments show that our method achieves competitive performance compared to state-of-the-art AU detection approaches on challenging benchmarks, including BP4D, DISFA, GFT, and BP4D+ in constrained scenarios and Aff-Wild2 in unconstrained scenarios. The code is available at https://github.com/ZhiwenShao/AC2D.

Facial Action Unit Detection by Adaptively Constraining Self-Attention and Causally Deconfounding Sample

TL;DR

Abstract

Paper Structure (28 sections, 18 equations, 6 figures, 9 tables)

This paper contains 28 sections, 18 equations, 6 figures, 9 tables.

Introduction
Related Work
Facial AU Detection with Self-Attention
Facial AU Detection with Causal Inference
Methodology
Overview
Adaptive Constraining on Self-Attention
Causal Deconfounding of Sample Confounder
Experiments
Datasets and Settings
Datasets
Implementation Details
Evaluation Metrics
Comparison with State-of-the-Art Methods
Evaluation on BP4D
...and 13 more sections

Figures (6)

Figure 1: Illustration of AU correlations and self-attention weight distribution on sample images from Aff-Wild2 kollias2019expressionkollias2021analysing with the same happy expression. In (a), AU co-occurrences contain common co-occurrence of AU 10 (upper lip raiser), AU 12 (lip corner puller), and AU 25 (lips part) across samples, as well as sample-specific AU co-occurrences. In (b), we visualize the average self-attention weight distribution of example AUs 10, 12, and 25 for our method without constraining self-attention and with constraining self-attention. The self-attention weight distribution is visualized as spatial distribution, in which attention weights are overlaid on the sample image for better viewing.
Figure 2: The architecture of our AC$^{2}$D framework, which uses a simplified structure of ResTv2 zhang2022rest. Given the $i$-th sample image in the training set, it first goes through a stem module and two stages to obtain rich feature, which is next shared by $m$ branches to predict the AU occurrence probability $\hat{p}_{i}^{(j)}$, respectively. Each AU branch applies constraint to the self-attention weight distribution of an intermediate block in the third stage via an attention regression loss $\mathcal{L}_a$, and then uses causal intervention to deconfound the sample confounder in AU feature $\mathbf{f}_i^{(j)}$ under the guidance of AU detection loss $\mathcal{L}_u$. The formula $c'\times l'\times l'$ attached to each module denotes the size of its output, and $\times n$ denotes replicating the structure for $n$ times. "$\star$" and $+$ denote element-wise multiplication and element-wise addition, respectively.
Figure 3: Definition to the locations of AU sub-centers, which is applicable to an aligned face with eye centers on the same horizontal line li2018eacshao2021jaa. Each AU has two sub-centers specified by two facial landmarks due to facial symmetry. The red dotted line denotes the distance between two inner eye corners, i.e. "scale".
Figure 4: Illustration of our causal diagram for each AU. (a) The conventional likelihood $P(Y^{(j)}|X)$. (b) The likelihood $P(Y^{(j)}|do(X))$ after causal intervention.
Figure 5: Visualization of learned self-attention $\mathbf{A}_i^{(j)}$ by our AC$^{2}$D, in terms of the average $\mathbf{A}_i^{avg(j)}$ and four example channels, for two sample images from Aff-Wild2 kollias2019expressionkollias2021analysing. For each sample image, the first row shows $\mathbf{A}_i^{(j)}$ and the next four rows show randomly selected example channels. To observe the variations across samples, the two images show the same example channels. Attention weights are overlaid on the sample image for better viewing.
...and 1 more figures

Facial Action Unit Detection by Adaptively Constraining Self-Attention and Causally Deconfounding Sample

TL;DR

Abstract

Facial Action Unit Detection by Adaptively Constraining Self-Attention and Causally Deconfounding Sample

Authors

TL;DR

Abstract

Table of Contents

Figures (6)