Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Joe Dhanith P R, Shravan Venkatraman, Vigya Sharma, Santhosh Malarvannan
TL;DR
The paper tackles multimodal emotion recognition by addressing temporal misalignment and weak feature representations through AVT-CA, a dual-stream Audio-Video Transformer with cross attention. It introduces a hierarchical video feature refinement that combines channel attention, spatial attention, and local feature extraction, followed by an intermediate transformer fusion and a cross-attention module to align audio-visual cues. Across RAVDESS, CMU-MOSEI, and CREMA-D, AVT-CA achieves state-of-the-art accuracy and F1-scores, demonstrating robust cross-modal integration and noise-aware fusion. The work provides publicly available code and outlines a clear path for extending MER with continual and self-supervised learning.
Abstract
Multimodal emotion recognition (MER) aims to infer human affect by jointly modeling audio and visual cues; however, existing approaches often struggle with temporal misalignment, weakly discriminative feature representations, and suboptimal fusion of heterogeneous modalities. To address these challenges, we propose AVT-CA, an Audio-Video Transformer architecture with cross attention for robust emotion recognition. The proposed model introduces a hierarchical video feature representation that combines channel attention, spatial attention, and local feature extraction to emphasize emotionally salient regions while suppressing irrelevant information. These refined visual features are integrated with audio representations through an intermediate transformer-based fusion mechanism that captures interlinked temporal dependencies across modalities. Furthermore, a cross-attention module selectively reinforces mutually consistent audio-visual cues, enabling effective feature selection and noise-aware fusion. Extensive experiments on three benchmark datasets, CMU-MOSEI, RAVDESS, and CREMA-D, demonstrate that AVT-CA consistently outperforms state-of-the-art baselines, achieving significant improvements in both accuracy and F1-score. Our source code is publicly available at https://github.com/shravan-18/AVTCA.
