Table of Contents
Fetching ...

AFL-Net: Integrating Audio, Facial, and Lip Modalities with a Two-step Cross-attention for Robust Speaker Diarization in the Wild

Yongkang Yin, Xu Li, Ying Shan, Yuexian Zou

TL;DR

AFL-Net tackles real-world speaker diarization by fusing audio, face, and dynamic lip movement through a two-step cross-attention mechanism, paired with a training masking strategy to emphasize audio when visuals are imperfect. The method extends AVR-Net with an explicit modality-availability signaling and a more robust scoring network, achieving lower diarization error rates than AVR-Net and competitive results against DyViSE, especially when augmented with large-scale VoxCeleb data and optional WavLM integration. Empirical results show AFL-Net is more effective at identity discrimination and more robust to visual feature absence, substantiating the benefits of the lip modality and the proposed fusion strategy. The approach offers practical significance for diarization in unconstrained videos and movies, where lighting, occlusion, and off-screen speakers are common. Overall, AFL-Net advances multi-modal diarization by improving robustness and accuracy in the wild through targeted fusion, masking, and clustering choices.

Abstract

Speaker diarization in real-world videos presents significant challenges due to varying acoustic conditions, diverse scenes, the presence of off-screen speakers, etc. This paper builds upon a previous study (AVR-Net) and introduces a novel multi-modal speaker diarization system, AFL-Net. The proposed AFL-Net incorporates dynamic lip movement as an additional modality to enhance the identity distinction. Besides, unlike AVR-Net which extracts high-level representations from each modality independently, AFL-Net employs a two-step cross-attention mechanism to sufficiently fuse different modalities, resulting in more comprehensive information to enhance the performance. Moreover, we also incorporated a masking strategy during training, where the face and lip modalities are randomly obscured. This strategy enhances the impact of the audio modality on the system outputs. Experimental results demonstrate that AFL-Net outperforms state-of-the-art baselines, such as the AVR-Net and DyViSE.

AFL-Net: Integrating Audio, Facial, and Lip Modalities with a Two-step Cross-attention for Robust Speaker Diarization in the Wild

TL;DR

AFL-Net tackles real-world speaker diarization by fusing audio, face, and dynamic lip movement through a two-step cross-attention mechanism, paired with a training masking strategy to emphasize audio when visuals are imperfect. The method extends AVR-Net with an explicit modality-availability signaling and a more robust scoring network, achieving lower diarization error rates than AVR-Net and competitive results against DyViSE, especially when augmented with large-scale VoxCeleb data and optional WavLM integration. Empirical results show AFL-Net is more effective at identity discrimination and more robust to visual feature absence, substantiating the benefits of the lip modality and the proposed fusion strategy. The approach offers practical significance for diarization in unconstrained videos and movies, where lighting, occlusion, and off-screen speakers are common. Overall, AFL-Net advances multi-modal diarization by improving robustness and accuracy in the wild through targeted fusion, masking, and clustering choices.

Abstract

Speaker diarization in real-world videos presents significant challenges due to varying acoustic conditions, diverse scenes, the presence of off-screen speakers, etc. This paper builds upon a previous study (AVR-Net) and introduces a novel multi-modal speaker diarization system, AFL-Net. The proposed AFL-Net incorporates dynamic lip movement as an additional modality to enhance the identity distinction. Besides, unlike AVR-Net which extracts high-level representations from each modality independently, AFL-Net employs a two-step cross-attention mechanism to sufficiently fuse different modalities, resulting in more comprehensive information to enhance the performance. Moreover, we also incorporated a masking strategy during training, where the face and lip modalities are randomly obscured. This strategy enhances the impact of the audio modality on the system outputs. Experimental results demonstrate that AFL-Net outperforms state-of-the-art baselines, such as the AVR-Net and DyViSE.
Paper Structure (14 sections, 5 equations, 2 figures, 4 tables)

This paper contains 14 sections, 5 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The proposed system architecture. (a) represents the AFL-Net architecture, where the MLP indicates several linear layers. (b) illustrates the multi-modal feature extractor and (c) demonstrates the cross-attention mechanism.
  • Figure 2: The DER performance comparison on AVA-AVD dataset under varying missing rates of the visual features.