Cross-Attention is Not Always Needed: Dynamic Cross-Attention for Audio-Visual Dimensional Emotion Recognition

R. Gnana Praveen; Jahangir Alam

Cross-Attention is Not Always Needed: Dynamic Cross-Attention for Audio-Visual Dimensional Emotion Recognition

R. Gnana Praveen, Jahangir Alam

TL;DR

Dynamic Cross-Attention (DCA) is proposed that can dynamically select cross-attended or unattended features on the fly based on their strong or weak complementary relationships respectively based on their strong or weak complementary relationships respectively.

Abstract

In video-based emotion recognition, audio and visual modalities are often expected to have a complementary relationship, which is widely explored using cross-attention. However, they may also exhibit weak complementary relationships, resulting in poor representations of audio-visual features, thus degrading the performance of the system. To address this issue, we propose Dynamic Cross-Attention (DCA) that can dynamically select cross-attended or unattended features on the fly based on their strong or weak complementary relationship with each other, respectively. Specifically, a simple yet efficient gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit a strong complementary relationship, otherwise unattended features. We evaluate the performance of the proposed approach on the challenging RECOLA and Aff-Wild2 datasets. We also compare the proposed approach with other variants of cross-attention and show that the proposed model consistently improves the performance on both datasets.

Cross-Attention is Not Always Needed: Dynamic Cross-Attention for Audio-Visual Dimensional Emotion Recognition

TL;DR

Abstract

Paper Structure (5 sections, 4 equations, 3 figures, 3 tables)

This paper contains 5 sections, 4 equations, 3 figures, 3 tables.

Introduction
Related Work
Proposed Model
Results and Discussion
Conclusion

Figures (3)

Figure 1: (a) Attention scores based on cross-attention for the subject "12-24-1920x1080" from the validation set of the Aff-Wild2 dataset. Here, both audio and visual modalities strongly complement each other, thereby assigning higher attention scores for face and voice with significant expressions (b) Attention scores based on cross-attention for the subject "21-24-1920x1080" from the validation set of the Aff-Wild2 dataset. In this case, the facial modality is corrupted due to extreme blur and pose, however, it has rich vocal expressions. Attending the corrupted face to rich vocal expressions fails to assign higher attention scores for vocal expressions.
Figure 2: Illustration of the proposed Dynamic Cross-Attention (DCA) model with vanilla Cross-Attention (CA) as the baseline.
Figure 3: Visualization of the predictions of valence and arousal for subjects "21-24-1920x1080" of the validation set of Aff-Wild2 dataset.

Cross-Attention is Not Always Needed: Dynamic Cross-Attention for Audio-Visual Dimensional Emotion Recognition

TL;DR

Abstract

Cross-Attention is Not Always Needed: Dynamic Cross-Attention for Audio-Visual Dimensional Emotion Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)