ARBEx: Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning

Azmine Toushik Wasi; Karlo Šerbetar; Raima Islam; Taki Hasan Rafi; Dong-Kyu Chae

ARBEx: Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning

Azmine Toushik Wasi, Karlo Šerbetar, Raima Islam, Taki Hasan Rafi, Dong-Kyu Chae

TL;DR

ARBEx addresses biased and uncertain labeling in facial expression learning by integrating a window-based cross-attention Vision Transformer with reliability balancing. The framework introduces trainable anchors and multi-head self-attention to correct labels, producing a final corrected distribution that stabilizes predictions. Experimental results across five FEL datasets show ARBEx surpasses prior SOTA methods, confirming the effectiveness of reliability balancing and anchor-based corrections in challenging, real-world data. The approach offers a scalable, robust solution for FEL that can integrate with various deep architectures and data pipelines, reducing bias and improving generalization in practical applications.

Abstract

In this paper, we introduce a framework ARBEx, a novel attentive feature extraction framework driven by Vision Transformer with reliability balancing to cope against poor class distributions, bias, and uncertainty in the facial expression learning (FEL) task. We reinforce several data pre-processing and refinement methods along with a window-based cross-attention ViT to squeeze the best of the data. We also employ learnable anchor points in the embedding space with label distributions and multi-head self-attention mechanism to optimize performance against weak predictions with reliability balancing, which is a strategy that leverages anchor points, attention scores, and confidence values to enhance the resilience of label predictions. To ensure correct label classification and improve the models' discriminative power, we introduce anchor loss, which encourages large margins between anchor points. Additionally, the multi-head self-attention mechanism, which is also trainable, plays an integral role in identifying accurate labels. This approach provides critical elements for improving the reliability of predictions and has a substantial positive effect on final prediction capabilities. Our adaptive model can be integrated with any deep neural network to forestall challenges in various recognition tasks. Our strategy outperforms current state-of-the-art methodologies, according to extensive experiments conducted in a variety of contexts.

ARBEx: Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning

TL;DR

Abstract

Paper Structure (30 sections, 26 equations, 7 figures, 5 tables)

This paper contains 30 sections, 26 equations, 7 figures, 5 tables.

Introduction
Our Contributions
Related Works
Facial Expression Learning (FEL)
Transformers in FEL
Uncertainty in FEL
Attention Networks in FEL
Approach
Problem Formulation
Window-Based Cross-Attention ViT
Reliability Balancing
Label Correction
Anchor Label Correction
Attentive Correction
Final Label correction
...and 15 more sections

Figures (7)

Figure 1: A synopsis of ARBEx. Feature Extraction provides feature maps to generate initial predictions. Confidence distributions of initial labels are mostly inconsistent, unstable and unreliable. Reliability Balancing approach aids in stabilizing the distributions and addressing inconsistent and unreliable labeling.
Figure 2: Pipeline of ARBEx. Heavy Augmentation is applied to the input images and Data Refinement method selects training batch with properly distributed classes for each epoch. Window-Based Cross-Attention ViT framework uses mutli-level feature extraction and integration to provide embeddings (Feature Vectors).Linear Reduction Layer reduces the feature vector size for fast modeling. MLP predicts the primary labels and Confidence is calculated from label distribution. Reliability balancing receives embeddings and processes in two ways. Firstly, it places anchors in the embedding space. It improves prediction probabilities by utilizing trainable anchors for searching similarities in embedding space. On the other way, Multi-head self-attention values are used to calculate label correction and confidence. Weighted Average of these two are used to calculate the final label correction. Using label correction, primary label distribution and confidence, final corrected label distribution is calculated, making the model more reliable.
Figure 3: Data flow in the Window-Based Cross-Attention ViT network
Figure 4: Examples of training samples in different datasets
Figure 5: Observation of confidence probability distributions in ARBEx using Aff-Wild2 dataset. Eight different emotions—Neutral, Anger, Fear, Disgust, Happiness, Sadness, Surprise, and Other—are represented by columns under each image sequentially. Primary Distribution (PD) is the initial prediction while Corrected Distribution (CD) is the accurate prediction after Reliability Balancing. The correct label after reliability balancing is marked as green, and the inaccurate primary prediction label is marked as yellow.
...and 2 more figures

ARBEx: Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning

TL;DR

Abstract

ARBEx: Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)