Table of Contents
Fetching ...

Dynamic Cross Attention for Audio-Visual Person Verification

R. Gnana Praveen, Jahangir Alam

TL;DR

This work tackles robust audio-visual person verification under weak inter-modal complementarity by introducing Dynamic Cross Attention (DCA), a gating-based extension of cross-attention. DCA uses per-modality gates to decide, via $Y_{go,a}=X_{att,a}^T W_{gl,a}$, $Y_{go,v}=X_{att,v}^T W_{gl,v}$, and softmax with temperature $T=0.1$, whether to rely on cross-attended versus unattended features, producing $G_a$ and $G_v$ that modulate feature fusion. The approach integrates with existing CA frameworks, demonstrating improvements over CA and Joint Cross-Attention (JCA) on Voxceleb1 through extensive ablations and state-of-the-art comparisons, highlighting its robustness to weak complementary relationships. The method enhances fusion reliability in noisy or misaligned conditions and is adaptable to other CA variants, with future work suggesting training on VoxCeleb2 to further improve generalization.

Abstract

Although person or identity verification has been predominantly explored using individual modalities such as face and voice, audio-visual fusion has recently shown immense potential to outperform unimodal approaches. Audio and visual modalities are often expected to pose strong complementary relationships, which plays a crucial role in effective audio-visual fusion. However, they may not always strongly complement each other, they may also exhibit weak complementary relationships, resulting in poor audio-visual feature representations. In this paper, we propose a Dynamic Cross-Attention (DCA) model that can dynamically select the cross-attended or unattended features on the fly based on the strong or weak complementary relationships, respectively, across audio and visual modalities. In particular, a conditional gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit strong complementary relationships, otherwise unattended features. Extensive experiments are conducted on the Voxceleb1 dataset to demonstrate the robustness of the proposed model. Results indicate that the proposed model consistently improves the performance on multiple variants of cross-attention while outperforming the state-of-the-art methods.

Dynamic Cross Attention for Audio-Visual Person Verification

TL;DR

This work tackles robust audio-visual person verification under weak inter-modal complementarity by introducing Dynamic Cross Attention (DCA), a gating-based extension of cross-attention. DCA uses per-modality gates to decide, via , , and softmax with temperature , whether to rely on cross-attended versus unattended features, producing and that modulate feature fusion. The approach integrates with existing CA frameworks, demonstrating improvements over CA and Joint Cross-Attention (JCA) on Voxceleb1 through extensive ablations and state-of-the-art comparisons, highlighting its robustness to weak complementary relationships. The method enhances fusion reliability in noisy or misaligned conditions and is adaptable to other CA variants, with future work suggesting training on VoxCeleb2 to further improve generalization.

Abstract

Although person or identity verification has been predominantly explored using individual modalities such as face and voice, audio-visual fusion has recently shown immense potential to outperform unimodal approaches. Audio and visual modalities are often expected to pose strong complementary relationships, which plays a crucial role in effective audio-visual fusion. However, they may not always strongly complement each other, they may also exhibit weak complementary relationships, resulting in poor audio-visual feature representations. In this paper, we propose a Dynamic Cross-Attention (DCA) model that can dynamically select the cross-attended or unattended features on the fly based on the strong or weak complementary relationships, respectively, across audio and visual modalities. In particular, a conditional gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit strong complementary relationships, otherwise unattended features. Extensive experiments are conducted on the Voxceleb1 dataset to demonstrate the robustness of the proposed model. Results indicate that the proposed model consistently improves the performance on multiple variants of cross-attention while outperforming the state-of-the-art methods.
Paper Structure (5 sections, 6 equations, 2 figures, 2 tables)

This paper contains 5 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Attention scores based on cross-attention. In the top image, both audio and visual modalities strongly complement each other, thereby assigning higher attention scores for face and voice. In the bottom image, the facial modality is corrupted due to blur, however, speech signal is not corrupted. Attending the corrupted face to uncorrupted speech signal fails to assign higher attention scores for speech signals.
  • Figure 2: Illustration of the proposed Dynamic Cross-Attention (DCA) model with vanilla Cross-Attention (CA) as the baseline.