Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition

R. Gnana Praveen; Eric Granger; Patrick Cardinal

Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition

R. Gnana Praveen, Eric Granger, Patrick Cardinal

TL;DR

This work introduces a cross-attentional audio-visual fusion framework for dimensional emotion recognition, leveraging a cross-correlation based attention mechanism to exploit inter-modal relationships between facial and vocal features. The Visual Network (I3D) and Audio Network (spectrogram-CNN) extract modality-specific features, which are then fused via a cross-attention module that computes a cross-correlation matrix $\boldsymbol{Z}=\boldsymbol{X}_{\mathbf{a}}^{\top}\boldsymbol{W}\boldsymbol{X}_{\mathbf{v}}$ and derives attention maps for both modalities. The attended AV representation is fed to fully connected layers to predict continuous valence and arousal, trained with a Concordance Correlation Coefficient loss $\text{Loss}=1-CCC(\rho_c)$ and validated on the RECOLA and Fatigue datasets, showing improvements over baseline fusion strategies. The approach demonstrates that incorporating inter-modal cross-correlation in a lightweight cross-attention framework yields superior performance with reduced computational complexity compared to multi-stage attention, highlighting its potential for real-world affective computing tasks.

Abstract

Multimodal analysis has recently drawn much interest in affective computing, since it can improve the overall accuracy of emotion recognition over isolated uni-modal approaches. The most effective techniques for multimodal emotion recognition efficiently leverage diverse and complimentary sources of information, such as facial, vocal, and physiological modalities, to provide comprehensive feature representations. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos, where complex spatiotemporal relationships may be captured. Most of the existing fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complimentary nature of audio-visual (A-V) modalities. We introduce a cross-attentional fusion approach to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. Our new cross-attentional A-V fusion model efficiently leverages the inter-modal relationships. In particular, it computes cross-attention weights to focus on the more contributive features across individual modalities, and thereby combine contributive feature representations, which are then fed to fully connected layers for the prediction of valence and arousal. The effectiveness of the proposed approach is validated experimentally on videos from the RECOLA and Fatigue (private) data-sets. Results indicate that our cross-attentional A-V fusion model is a cost-effective approach that outperforms state-of-the-art fusion approaches. Code is available: \url{https://github.com/praveena2j/Cross-Attentional-AV-Fusion}

Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition

TL;DR

Abstract

Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)