Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention
R. Gnana Praveen, Jahangir Alam
TL;DR
This work tackles robust audio-visual person verification by introducing a recursive joint cross-attention (RJCA) mechanism that jointly models intra- and inter-modal relationships between faces and voices. By forming a joint AV representation and iteratively refining cross-modal attention, the method produces more discriminative utterance-level embeddings, further enhanced by BLSTMs for temporal dynamics and attentive pooling. The approach, evaluated on Voxceleb1, demonstrates superior fusion performance over prior methods and ablations highlight the benefits of recursion and temporal modeling. The results suggest strong practical impact for multimodal verification systems and potential gains from larger datasets like VoxCeleb2 for improved generalization.
Abstract
Person or identity verification has been recently gaining a lot of attention using audio-visual fusion as faces and voices share close associations with each other. Conventional approaches based on audio-visual fusion rely on score-level or early feature-level fusion techniques. Though existing approaches showed improvement over unimodal systems, the potential of audio-visual fusion for person verification is not fully exploited. In this paper, we have investigated the prospect of effectively capturing both the intra- and inter-modal relationships across audio and visual modalities, which can play a crucial role in significantly improving the fusion performance over unimodal systems. In particular, we introduce a recursive fusion of a joint cross-attentional model, where a joint audio-visual feature representation is employed in the cross-attention framework in a recursive fashion to progressively refine the feature representations that can efficiently capture the intra-and inter-modal relationships. To further enhance the audio-visual feature representations, we have also explored BLSTMs to improve the temporal modeling of audio-visual feature representations. Extensive experiments are conducted on the Voxceleb1 dataset to evaluate the proposed model. Results indicate that the proposed model shows promising improvement in fusion performance by adeptly capturing the intra-and inter-modal relationships across audio and visual modalities.
