Table of Contents
Fetching ...

RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues

Tianrui Pan, Jie Liu, Bohan Wang, Jie Tang, Gangshan Wu

TL;DR

The paper tackles the problem of robustly separating multiple overlapped speakers in audio-visual data when visual cues may be incomplete. It introduces a unified, simultaneous multi-speaker AVSS framework that integrates intra-chunk and inter-chunk speech modeling with speaker-wise interactions, including GAV, GCA, SAI, and SAVI modules to leverage available visuals and compensate for missing cues. The model is trained with a split approach that treats visually-guided and non-visually-guided outputs differently, using PIT-SI-SDR and SI-SDR losses, respectively. Experiments on VoxCeleb2 and LRS3 demonstrate state-of-the-art performance for 2–5 speaker mixtures and show robust performance under missing-visual scenarios, with improved downstream AV-HuBERT WER, highlighting practical impact for real-world multi-speaker communication tasks.

Abstract

While existing Audio-Visual Speech Separation (AVSS) methods primarily concentrate on the audio-visual fusion strategy for two-speaker separation, they demonstrate a severe performance drop in the multi-speaker separation scenarios. Typically, AVSS methods employ guiding videos to sequentially isolate individual speakers from the given audio mixture, resulting in notable missing and noisy parts across various segments of the separated speech. In this study, we propose a simultaneous multi-speaker separation framework that can facilitate the concurrent separation of multiple speakers within a singular process. We introduce speaker-wise interactions to establish distinctions and correlations among speakers. Experimental results on the VoxCeleb2 and LRS3 datasets demonstrate that our method achieves state-of-the-art performance in separating mixtures with 2, 3, 4, and 5 speakers, respectively. Additionally, our model can utilize speakers with complete audio-visual information to mitigate other visual-deficient speakers, thereby enhancing its resilience to missing visual cues. We also conduct experiments where visual information for specific speakers is entirely absent or visual frames are partially missing. The results demonstrate that our model consistently outperforms others, exhibiting the smallest performance drop across all settings involving 2, 3, 4, and 5 speakers.

RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues

TL;DR

The paper tackles the problem of robustly separating multiple overlapped speakers in audio-visual data when visual cues may be incomplete. It introduces a unified, simultaneous multi-speaker AVSS framework that integrates intra-chunk and inter-chunk speech modeling with speaker-wise interactions, including GAV, GCA, SAI, and SAVI modules to leverage available visuals and compensate for missing cues. The model is trained with a split approach that treats visually-guided and non-visually-guided outputs differently, using PIT-SI-SDR and SI-SDR losses, respectively. Experiments on VoxCeleb2 and LRS3 demonstrate state-of-the-art performance for 2–5 speaker mixtures and show robust performance under missing-visual scenarios, with improved downstream AV-HuBERT WER, highlighting practical impact for real-world multi-speaker communication tasks.

Abstract

While existing Audio-Visual Speech Separation (AVSS) methods primarily concentrate on the audio-visual fusion strategy for two-speaker separation, they demonstrate a severe performance drop in the multi-speaker separation scenarios. Typically, AVSS methods employ guiding videos to sequentially isolate individual speakers from the given audio mixture, resulting in notable missing and noisy parts across various segments of the separated speech. In this study, we propose a simultaneous multi-speaker separation framework that can facilitate the concurrent separation of multiple speakers within a singular process. We introduce speaker-wise interactions to establish distinctions and correlations among speakers. Experimental results on the VoxCeleb2 and LRS3 datasets demonstrate that our method achieves state-of-the-art performance in separating mixtures with 2, 3, 4, and 5 speakers, respectively. Additionally, our model can utilize speakers with complete audio-visual information to mitigate other visual-deficient speakers, thereby enhancing its resilience to missing visual cues. We also conduct experiments where visual information for specific speakers is entirely absent or visual frames are partially missing. The results demonstrate that our model consistently outperforms others, exhibiting the smallest performance drop across all settings involving 2, 3, 4, and 5 speakers.
Paper Structure (15 sections, 12 equations, 5 figures, 3 tables)

This paper contains 15 sections, 12 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: AVSS task description and the contribution of our work. (a) shows the basic audio-visual speech separation process. It uses the visual cue to extract the corresponding speech from the mixture. The separation process is repeated to separate more speakers. (b) demonstrates our proposed separation process, which addresses the task of separating multiple speakers jointly. It can simultaneously separate multiple speech sources using multiple visual cues, maintaining robustness to missing visual cues.
  • Figure 2: The overall pipeline of our model. In this figure, we assume that $N=3$, $P=2$, $L=4$ and $S=3$. The separation block comprises the LCA, GAV, SAI, and SAVI modules. LCA enables effective intra-chunk processing. GCA facilitates inter-chunk processing and enhances inter-chunk disparity by cross-modal visual features. SAI establishes distinctions and correlations between different separated speakers. SAVI further enhances the disparity between hard samples that are similar to each other. Additionally, in scenarios where the available visual information is insufficient for the number of speeches to be separated, we employ a split-and-concat operation to achieve cross-modal interaction.
  • Figure 3: The visual robustness performance with visual absence. We compare the performance of our model with two other methods in 2,3,4 and 5 mixtures, respectively. We evaluate the performance of each model under two conditions: when all visual cues are present and when one visual cue is missing. The picture shows each model's drop rate in performance when one visual cue is missing. We do not include scenarios where more than one visual cue is missing, as they mostly result in negative values.
  • Figure 4: The visual robustness performance with different missing video frame rates. It is important to note that we apply the missing frame rate to all corresponding speakers' video frames. The x-labels of our model represent different missing frame percentages in each corresponding video, and the y-label represents the corresponding SI-SDR metric.
  • Figure 5: Visualization results for the separation of a 5-mixture speech audio. Each spectrogram shows time on the horizontal axis and frequency on the vertical axis. The red box indicates time differences within the same row, while the black box highlights frequency differences.