Table of Contents
Fetching ...

Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues

Soufiane Belharbi, Marco Pedersoli, Alessandro Lameiras Koerich, Simon Bacon, Eric Granger

TL;DR

This work tackles the lack of interpretability in state-of-the-art facial expression recognition (FER) by introducing a guided interpretable FER framework that leverages spatial action unit (AU) cues. AU heatmaps are constructed from an AU codebook, facial landmarks, and the image’s expression label, and are used to supervise layer-wise spatial attention through a cosine-based alignment loss during training. The approach is model-agnostic and requires only image-level supervision, with no extra manual annotations or architectural changes, and it is validated on RAF-DB and AffectNet where both interpretability and classification performance are preserved or improved. The method also demonstrates enhanced interpretability for CAM-based WSOL classifiers, making FER decisions more aligned with expert knowledge and potentially more trustworthy in clinical and applied settings.

Abstract

Although state-of-the-art classifiers for facial expression recognition (FER) can achieve a high level of accuracy, they lack interpretability, an important feature for end-users. Experts typically associate spatial action units (\aus) from a codebook to facial regions for the visual interpretation of expressions. In this paper, the same expert steps are followed. A new learning strategy is proposed to explicitly incorporate \au cues into classifier training, allowing to train deep interpretable models. During training, this \au codebook is used, along with the input image expression label, and facial landmarks, to construct a \au heatmap that indicates the most discriminative image regions of interest w.r.t the facial expression. This valuable spatial cue is leveraged to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with \au heatmaps. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with \au maps, simulating the expert decision process. Our strategy only relies on image class expression for supervision, without additional manual annotations. Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time. Our extensive evaluation on two public benchmarks \rafdb, and \affectnet datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on class activation mapping (CAM) methods, and show that our approach can also improve CAM interpretability.

Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues

TL;DR

This work tackles the lack of interpretability in state-of-the-art facial expression recognition (FER) by introducing a guided interpretable FER framework that leverages spatial action unit (AU) cues. AU heatmaps are constructed from an AU codebook, facial landmarks, and the image’s expression label, and are used to supervise layer-wise spatial attention through a cosine-based alignment loss during training. The approach is model-agnostic and requires only image-level supervision, with no extra manual annotations or architectural changes, and it is validated on RAF-DB and AffectNet where both interpretability and classification performance are preserved or improved. The method also demonstrates enhanced interpretability for CAM-based WSOL classifiers, making FER decisions more aligned with expert knowledge and potentially more trustworthy in clinical and applied settings.

Abstract

Although state-of-the-art classifiers for facial expression recognition (FER) can achieve a high level of accuracy, they lack interpretability, an important feature for end-users. Experts typically associate spatial action units (\aus) from a codebook to facial regions for the visual interpretation of expressions. In this paper, the same expert steps are followed. A new learning strategy is proposed to explicitly incorporate \au cues into classifier training, allowing to train deep interpretable models. During training, this \au codebook is used, along with the input image expression label, and facial landmarks, to construct a \au heatmap that indicates the most discriminative image regions of interest w.r.t the facial expression. This valuable spatial cue is leveraged to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with \au heatmaps. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with \au maps, simulating the expert decision process. Our strategy only relies on image class expression for supervision, without additional manual annotations. Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time. Our extensive evaluation on two public benchmarks \rafdb, and \affectnet datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on class activation mapping (CAM) methods, and show that our approach can also improve CAM interpretability.
Paper Structure (17 sections, 3 equations, 11 figures, 3 tables)

This paper contains 17 sections, 3 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Comparison of class activation mapping (CAM) and attention maps produced at inference time using a FER classifier trained without (top) and with (bottom, ours) AU maps. In our experiments, we use an identical architecture to compare the impact of training with and without AU maps. The difference only resides in an additional AU-based training loss using AU maps. Training a FER classifier with our AU maps yields attention and CAM that are aligned with the expert's knowledge used to assess basic facial expressions in images Martinez19, as illustrated in the AU map ${\bm{A}}$. Consequently, our approach allows training a classifier that provides reliable interpretability, without compromising the classification accuracy. Details of our training strategy are presented in Fig.\ref{['fig:proposal']}. Note that in CAM-based models, the classification head can be fully convolutional or standard fully connected layers pooling posterior probabilities.
  • Figure 2: Codebook of basic facial expressions and their associated AUsMartinez19. The spatial AU map is built using the image expression to select the right corresponding AU subset in combination with facial landmarks, which are employed to localize these same AUs in the image. In particular, the location of landmarks is used to estimate AU positions. For instance, the right 'Cheek' location is estimated using landmark 47 (middle of the low right eye) and 11 (right side of the jaw). The code 'AUx' is the identifier of the AUMartinez19.
  • Figure 3: Our interpretable classifier for the FER task (training and inference). Each basic facial expression can be determined via a set of AUsMartinez19. Therefore, to train our interpretable FER classifier, we first extract facial landmarks and build a discriminative spatial map ${\bm{A}}$ that contains the set of all AUs associated with the image class expression Martinez19. This map is used as localization cues to train layer-wise attention ${\bm{T}_l}$ to focus on the ROIs highlighted in the AU map. A classification loss, such as cross-entropy, is also used. Once trained, the classifier yields an interpretable layer-wise attention map. When a CAM method rony2023deep is considered, the classifier can also produce a per-class interpretable map.
  • Figure 4: Illustration of interpretability prediction over RAF-DB test samples using CAM method zhou2016learning with and without action units alignment. From left to right: Input image, true action units map ${\bm{A}}$, CAM ${\bm{M}[k]}$, attention ${\bm{T}_5}$.
  • Figure 5: Ablations on the RAF-DB test set: impact of ${\lambda}$ over classification and localization (interpretability) performance. Alignment to AUs is performed over layer 5 of ResNet50 heZRS16 with CAM zhou2016learning method.
  • ...and 6 more figures