Table of Contents
Fetching ...

Facial Affect Recognition based on Multi Architecture Encoder and Feature Fusion for the ABAW7 Challenge

Kang Shen, Xuxiong Liu, Boyan Wang, Jun Yao, Xin Liu, Yujie Guan, Yu Wang, Gengchen Li, Xiao Sun

TL;DR

The paper tackles affective behavior analysis under ABAW7, addressing VA estimation, expression recognition, and AU detection in the wild. It proposes a multi-architecture visual feature extraction pipeline (ResNet-18, POSTER, POSTER2, FAU) whose outputs are aligned with an affine module and fused through a Transformer Encoder to capture temporal and inter-feature interactions. Task-specific losses are employed (MSE/CCC for VA, CE for Expr, Weighted Asymmetric Loss for AU), with an integrated learning strategy across sub-datasets. Experimental results on the s-Aff-Wild2 dataset demonstrate substantial improvements over baselines, with the ResNet-18+POSTER2+FAU fusion showing strong performance across VA, FER, and AU, highlighting the advantage of diverse, well-aligned features.

Abstract

In this paper, we present our approach to addressing the challenges of the 7th ABAW competition. The competition comprises three sub-challenges: Valence Arousal (VA) estimation, Expression (Expr) classification, and Action Unit (AU) detection. To tackle these challenges, we employ state-of-the-art models to extract powerful visual features. Subsequently, a Transformer Encoder is utilized to integrate these features for the VA, Expr, and AU sub-challenges. To mitigate the impact of varying feature dimensions, we introduce an affine module to align the features to a common dimension. Overall, our results significantly outperform the baselines.

Facial Affect Recognition based on Multi Architecture Encoder and Feature Fusion for the ABAW7 Challenge

TL;DR

The paper tackles affective behavior analysis under ABAW7, addressing VA estimation, expression recognition, and AU detection in the wild. It proposes a multi-architecture visual feature extraction pipeline (ResNet-18, POSTER, POSTER2, FAU) whose outputs are aligned with an affine module and fused through a Transformer Encoder to capture temporal and inter-feature interactions. Task-specific losses are employed (MSE/CCC for VA, CE for Expr, Weighted Asymmetric Loss for AU), with an integrated learning strategy across sub-datasets. Experimental results on the s-Aff-Wild2 dataset demonstrate substantial improvements over baselines, with the ResNet-18+POSTER2+FAU fusion showing strong performance across VA, FER, and AU, highlighting the advantage of diverse, well-aligned features.

Abstract

In this paper, we present our approach to addressing the challenges of the 7th ABAW competition. The competition comprises three sub-challenges: Valence Arousal (VA) estimation, Expression (Expr) classification, and Action Unit (AU) detection. To tackle these challenges, we employ state-of-the-art models to extract powerful visual features. Subsequently, a Transformer Encoder is utilized to integrate these features for the VA, Expr, and AU sub-challenges. To mitigate the impact of varying feature dimensions, we introduce an affine module to align the features to a common dimension. Overall, our results significantly outperform the baselines.
Paper Structure (17 sections, 7 equations, 1 figure, 1 table)

This paper contains 17 sections, 7 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: The overall framework of our proposed method. Visual Extractors contain EAC, ResNet18, POSTER, etc.The design of the transformer encoder is consistent with AshishVaswani2017AttentionIA.