Batch Transformer: Look for Attention in Batch

Myung Beom Her; Jisu Jeong; Hojoon Song; Ji-Hyeong Han

Batch Transformer: Look for Attention in Batch

Myung Beom Her, Jisu Jeong, Hojoon Song, Ji-Hyeong Han

TL;DR

This work tackles facial expression recognition in-the-wild by addressing uncertainty in single-image cues. It introduces a Batch Transformer (BTN) that combines Class Batch Attention (CBA) and Multi-Level Attention (MLA) to leverage information across a batch and across semantic feature levels, respectively, and uses a Batch Transformer (BT) to enforce batch-consistent predictions during training. BTN integrates two bottom-up branches (image features via IR50-ArcFace and landmark features via MobileFaceNet) with MLA and CBA, culminating in a ViT-based classifier, optimized by a composite loss $L = \lambda L_{VIT} + L_{BT} + L_{CBA}$. Empirically, BTN achieves state-of-the-art results on RAF-DB and AffectNet, demonstrating improved robustness to occlusion, low resolution, pose, and illumination variations, and highlighting the practical value of batch-aware attention for FER in the wild.

Abstract

Facial expression recognition (FER) has received considerable attention in computer vision, with "in-the-wild" environments such as human-computer interaction. However, FER images contain uncertainties such as occlusion, low resolution, pose variation, illumination variation, and subjectivity, which includes some expressions that do not match the target label. Consequently, little information is obtained from a noisy single image and it is not trusted. This could significantly degrade the performance of the FER task. To address this issue, we propose a batch transformer (BT), which consists of the proposed class batch attention (CBA) module, to prevent overfitting in noisy data and extract trustworthy information by training on features reflected from several images in a batch, rather than information from a single image. We also propose multi-level attention (MLA) to prevent overfitting the specific features by capturing correlations between each level. In this paper, we present a batch transformer network (BTN) that combines the above proposals. Experimental results on various FER benchmark datasets show that the proposed BTN consistently outperforms the state-ofthe-art in FER datasets. Representative results demonstrate the promise of the proposed BTN for FER.

Batch Transformer: Look for Attention in Batch

TL;DR

. Empirically, BTN achieves state-of-the-art results on RAF-DB and AffectNet, demonstrating improved robustness to occlusion, low resolution, pose, and illumination variations, and highlighting the practical value of batch-aware attention for FER in the wild.

Abstract

Paper Structure (14 sections, 5 equations, 8 figures, 8 tables)

This paper contains 14 sections, 5 equations, 8 figures, 8 tables.

Introduction
Related Work
Facial Expression Recognition (FER)
Vision Transformers
Method
Batch Transformer Network
Batch Transformer
Experiment
Datasets
Implementation Details
Comparison with the State-of-the-Art Methods
Ablation Studies
Visualization
Conclusion

Figures (8)

Figure 1: (a) shows multi-head self attention (MHSA), whose data cannot affect each other in training. (b) is class batch attention (CBA), whose data influence each other. It provides trustworthy information by reflecting class predictions of several images with similar features in a batch. We start from it to build the batch transformer.
Figure 2: An overview of batch transformer network (BTN) and its sub-network architecture. The two pre-trained networks are employed for training BTN, one is IR50, which is pretrained with the MS-Celeb-1M for extracting image feature, and the other is frozen MobileFaceNet, which is pretrained with the MS-Celeb-1M for extracting landmark features. Each semantic level features of two backbones are forwarded to multi-head cross attention to capture attention for landmark features in image features. Captured features $S_l, l=1,2,3$ are forwarded to multi-level attention (MLA). After $F_{MLA}$ is embedding to E, it is forwarded to batch transformer (BT). N denotes the number of emotion labels.
Figure 3: Structure of the batch transformer. The feature map $F_{MLA}$ is embedded to $E$, using a convolution layer. Embedded features, $E$, are channel-positional encoded to obtain the same position per channel. After this, channel-positional encoded features and class prediction, $output_{VIT}$, are forwarded to class batch attention (CBA) for reflecting class prediction about several images with similar features. $P_{CBA}$ is added to $P_{VIT}$ for fusing features of a single image with features of several images. $F_{i}^{j}$, $P_{k}^{j}$ denote the $i^{\text{th}}$ channel feature map of the $j^{\text{th}}$ image in a batch and the $k^{\text{th}}$ class prediction of the $j^{\text{th}}$ image in the batch, respectively. B and N are the number of images in the batch and the number of class, respectively.
Figure 4: Predicted probability distribution results of with batch transformer and without batch transformer on RAF-DB, especially hard cases such as occlusion, low resolution, pose variation, and illumination variation. Ground truth labels and correct predictions are marked in red.
Figure 5: Visualization of the activation map generated by Score-CAM for comparing the proposed BTN with POSTER++ in each semantic level.
...and 3 more figures

Batch Transformer: Look for Attention in Batch

TL;DR

Abstract

Batch Transformer: Look for Attention in Batch

Authors

TL;DR

Abstract

Table of Contents

Figures (8)