Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

Jun Yu; Naixiang Zheng; Guoyuan Wang; Yunxiang Zhang; Lingsi Zhu; Jiaen Liang; Wei Huang; Shengping Liu

Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

Jun Yu, Naixiang Zheng, Guoyuan Wang, Yunxiang Zhang, Lingsi Zhu, Jiaen Liang, Wei Huang, Shengping Liu

TL;DR

This work proposes a multimodal framework that dynamically fuses visual and audio representations that effectively handles missing modalities and complex spatiotemporal dependencies for Affective Behavior Analysis in-the-wild (ABAW) Expression challenge.

Abstract

Emotion recognition in real-world environments is hindered by partial occlusions, missing modalities, and severe class imbalance. To address these issues, particularly for the Affective Behavior Analysis in-the-wild (ABAW) Expression challenge, we propose a multimodal framework that dynamically fuses visual and audio representations. Our approach uses a dual-branch Transformer architecture featuring a safe cross-attention mechanism and a modality dropout strategy. This design allows the network to rely on audio-based predictions when visual cues are absent. To mitigate the long-tail distribution of the Aff-Wild2 dataset, we apply focal loss optimization, combined with a sliding-window soft voting strategy to capture dynamic emotional transitions and reduce frame-level classification jitter. Experiments demonstrate that our framework effectively handles missing modalities and complex spatiotemporal dependencies, achieving an accuracy of 60.79% and an F1-score of 0.5029 on the Aff-Wild2 validation set.

Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

TL;DR

Abstract

Paper Structure (18 sections, 6 equations, 1 figure, 3 tables)

This paper contains 18 sections, 6 equations, 1 figure, 3 tables.

Introduction
Related Work
In-the-Wild Affective Research
EXPR in the ABAW Competition
Method
Overview
Feature Extraction and Pre-training
Multimodal Attention Network
Modality Dropout and Safe Attention Mechanism
Optimization Objective
Inference Strategy and Post-processing
Experiments
Datasets and Pre-training Strategy
Multimodal Feature Extraction
Baseline Design and Modality Weight Analysis
...and 3 more sections

Figures (1)

Figure 1: The proposed multimodal emotion recognition framework processes video and audio inputs through BEiT-Large and WavLM-Large, aligns them in a unified embedding space, and dynamically fuses the representations via a multimodal attention network for MLP classification.

Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

TL;DR

Abstract

Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

Authors

TL;DR

Abstract

Table of Contents

Figures (1)