Inconsistency-Aware Cross-Attention for Audio-Visual Fusion in Dimensional Emotion Recognition
G Rajasekhar, Jahangir Alam
TL;DR
This work tackles dimensional audio-visual emotion recognition by addressing weak inter-modal complementarity that can degrade cross-attention. It introduces Inconsistency-Aware Cross-Attention (IACA), a two-stage gating mechanism that adaptively switches between cross-attended and self-attended features and then fuses modalities while downweighting corrupted information. Empirical results on Aff-Wild2 show that IACA consistently improves CA-based fusion methods, achieving high concordance correlation coefficients and robustness to missing audio. The approach advances multimodal fusion by making cross-modal interactions resilient to inconsistency, with practical impact for robust emotion inference in unconstrained settings.
Abstract
Leveraging complementary relationships across modalities has recently drawn a lot of attention in multimodal emotion recognition. Most of the existing approaches explored cross-attention to capture the complementary relationships across the modalities. However, the modalities may also exhibit weak complementary relationships, which may deteriorate the cross-attended features, resulting in poor multimodal feature representations. To address this problem, we propose Inconsistency-Aware Cross-Attention (IACA), which can adaptively select the most relevant features on-the-fly based on the strong or weak complementary relationships across audio and visual modalities. Specifically, we design a two-stage gating mechanism that can adaptively select the appropriate relevant features to deal with weak complementary relationships. Extensive experiments are conducted on the challenging Aff-Wild2 dataset to show the robustness of the proposed model.
