Table of Contents
Fetching ...

Inconsistency-Aware Cross-Attention for Audio-Visual Fusion in Dimensional Emotion Recognition

G Rajasekhar, Jahangir Alam

TL;DR

This work tackles dimensional audio-visual emotion recognition by addressing weak inter-modal complementarity that can degrade cross-attention. It introduces Inconsistency-Aware Cross-Attention (IACA), a two-stage gating mechanism that adaptively switches between cross-attended and self-attended features and then fuses modalities while downweighting corrupted information. Empirical results on Aff-Wild2 show that IACA consistently improves CA-based fusion methods, achieving high concordance correlation coefficients and robustness to missing audio. The approach advances multimodal fusion by making cross-modal interactions resilient to inconsistency, with practical impact for robust emotion inference in unconstrained settings.

Abstract

Leveraging complementary relationships across modalities has recently drawn a lot of attention in multimodal emotion recognition. Most of the existing approaches explored cross-attention to capture the complementary relationships across the modalities. However, the modalities may also exhibit weak complementary relationships, which may deteriorate the cross-attended features, resulting in poor multimodal feature representations. To address this problem, we propose Inconsistency-Aware Cross-Attention (IACA), which can adaptively select the most relevant features on-the-fly based on the strong or weak complementary relationships across audio and visual modalities. Specifically, we design a two-stage gating mechanism that can adaptively select the appropriate relevant features to deal with weak complementary relationships. Extensive experiments are conducted on the challenging Aff-Wild2 dataset to show the robustness of the proposed model.

Inconsistency-Aware Cross-Attention for Audio-Visual Fusion in Dimensional Emotion Recognition

TL;DR

This work tackles dimensional audio-visual emotion recognition by addressing weak inter-modal complementarity that can degrade cross-attention. It introduces Inconsistency-Aware Cross-Attention (IACA), a two-stage gating mechanism that adaptively switches between cross-attended and self-attended features and then fuses modalities while downweighting corrupted information. Empirical results on Aff-Wild2 show that IACA consistently improves CA-based fusion methods, achieving high concordance correlation coefficients and robustness to missing audio. The approach advances multimodal fusion by making cross-modal interactions resilient to inconsistency, with practical impact for robust emotion inference in unconstrained settings.

Abstract

Leveraging complementary relationships across modalities has recently drawn a lot of attention in multimodal emotion recognition. Most of the existing approaches explored cross-attention to capture the complementary relationships across the modalities. However, the modalities may also exhibit weak complementary relationships, which may deteriorate the cross-attended features, resulting in poor multimodal feature representations. To address this problem, we propose Inconsistency-Aware Cross-Attention (IACA), which can adaptively select the most relevant features on-the-fly based on the strong or weak complementary relationships across audio and visual modalities. Specifically, we design a two-stage gating mechanism that can adaptively select the appropriate relevant features to deal with weak complementary relationships. Extensive experiments are conducted on the challenging Aff-Wild2 dataset to show the robustness of the proposed model.
Paper Structure (5 sections, 8 equations, 3 figures, 2 tables)

This paper contains 5 sections, 8 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Attention scores are normalized between 0 and 1. (a) Cross-attention scores for the subject named "12-24-1920x1080" of Affwild2 dataset. Both the modalities exhibit higher attention scores due to their strong complementary nature (portraying significant expressions). (b) Cross-attention scores of subject named "21-24-1920x1080" in Affwild2 dataset. Here, facial modality is corrupted due to extreme pose and blur, but vocal expressions are uncorrupted exhibiting rich emotions. Attending noisy face to rich vocal expressions result in lower attention scores, thereby losing the rich vocal expressions.
  • Figure 2: Illustration of the proposed Inconsistency-Aware Cross-Attention model.
  • Figure 3: CCC Performance of the proposed model along with RJCA with a growing proportion of missing audio modality