Table of Contents
Fetching ...

Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention

Joe Dhanith P R, Shravan Venkatraman, Vigya Sharma, Santhosh Malarvannan

TL;DR

The paper tackles multimodal emotion recognition by addressing temporal misalignment and weak feature representations through AVT-CA, a dual-stream Audio-Video Transformer with cross attention. It introduces a hierarchical video feature refinement that combines channel attention, spatial attention, and local feature extraction, followed by an intermediate transformer fusion and a cross-attention module to align audio-visual cues. Across RAVDESS, CMU-MOSEI, and CREMA-D, AVT-CA achieves state-of-the-art accuracy and F1-scores, demonstrating robust cross-modal integration and noise-aware fusion. The work provides publicly available code and outlines a clear path for extending MER with continual and self-supervised learning.

Abstract

Multimodal emotion recognition (MER) aims to infer human affect by jointly modeling audio and visual cues; however, existing approaches often struggle with temporal misalignment, weakly discriminative feature representations, and suboptimal fusion of heterogeneous modalities. To address these challenges, we propose AVT-CA, an Audio-Video Transformer architecture with cross attention for robust emotion recognition. The proposed model introduces a hierarchical video feature representation that combines channel attention, spatial attention, and local feature extraction to emphasize emotionally salient regions while suppressing irrelevant information. These refined visual features are integrated with audio representations through an intermediate transformer-based fusion mechanism that captures interlinked temporal dependencies across modalities. Furthermore, a cross-attention module selectively reinforces mutually consistent audio-visual cues, enabling effective feature selection and noise-aware fusion. Extensive experiments on three benchmark datasets, CMU-MOSEI, RAVDESS, and CREMA-D, demonstrate that AVT-CA consistently outperforms state-of-the-art baselines, achieving significant improvements in both accuracy and F1-score. Our source code is publicly available at https://github.com/shravan-18/AVTCA.

Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention

TL;DR

The paper tackles multimodal emotion recognition by addressing temporal misalignment and weak feature representations through AVT-CA, a dual-stream Audio-Video Transformer with cross attention. It introduces a hierarchical video feature refinement that combines channel attention, spatial attention, and local feature extraction, followed by an intermediate transformer fusion and a cross-attention module to align audio-visual cues. Across RAVDESS, CMU-MOSEI, and CREMA-D, AVT-CA achieves state-of-the-art accuracy and F1-scores, demonstrating robust cross-modal integration and noise-aware fusion. The work provides publicly available code and outlines a clear path for extending MER with continual and self-supervised learning.

Abstract

Multimodal emotion recognition (MER) aims to infer human affect by jointly modeling audio and visual cues; however, existing approaches often struggle with temporal misalignment, weakly discriminative feature representations, and suboptimal fusion of heterogeneous modalities. To address these challenges, we propose AVT-CA, an Audio-Video Transformer architecture with cross attention for robust emotion recognition. The proposed model introduces a hierarchical video feature representation that combines channel attention, spatial attention, and local feature extraction to emphasize emotionally salient regions while suppressing irrelevant information. These refined visual features are integrated with audio representations through an intermediate transformer-based fusion mechanism that captures interlinked temporal dependencies across modalities. Furthermore, a cross-attention module selectively reinforces mutually consistent audio-visual cues, enabling effective feature selection and noise-aware fusion. Extensive experiments on three benchmark datasets, CMU-MOSEI, RAVDESS, and CREMA-D, demonstrate that AVT-CA consistently outperforms state-of-the-art baselines, achieving significant improvements in both accuracy and F1-score. Our source code is publicly available at https://github.com/shravan-18/AVTCA.
Paper Structure (55 sections, 6 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 55 sections, 6 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Examples of facial expressions from the CMU-MOSEI dataset.
  • Figure 2: Architecture of the proposed AVT-CA model. Audio and video inputs are processed by modality-specific convolutional blocks, refined through attention mechanisms, and fused via intermediate transformers and cross-attention to produce final emotion predictions.
  • Figure 3: Distribution of emotion categories across datasets and splits. Each row shows one benchmark (RAVDESS, CMU-MOSEI, CREMA-D), while columns report the label distribution for the full dataset, training set, and validation set. All datasets are approximately balanced across classes, indicating that models trained on these data are not biased toward a particular emotion.
  • Figure 4: Performance curves of AVT-CA on the RAVDESS dataset, illustrating accuracy, F1-Score, and loss behavior during training and validation.
  • Figure 5: Performance curves of AVT-CA on the CMU-MOSEI dataset, illustrating accuracy, F1-Score, and loss behavior during training and validation.
  • ...and 1 more figures