Table of Contents
Fetching ...

Online speaker diarization of meetings guided by speech separation

Elio Gruttadauria, Mathieu Fontaine, Slim Essid

TL;DR

This paper tackles overlapped speech in online speaker diarization by extending the speech separation guided diarization (SSGD) framework to multi-speaker meetings. It combines a sliding-window speech separation model (ConvTasNet or DPRNN) with per-source VAD and a speaker-embedding–based stitching mechanism to perform incremental clustering across time, enabling online diarization without oracle information. End-to-end finetuning on real AMI data, including joint adaptation of the separation model and VAD, yields state-of-the-art DER on the AMI headset mix under full evaluation, with particular strength on overlapped speech regions. The approach is robust across different SSep architectures and output configurations, and demonstrates practical viability for real-time meeting diarization with variable numbers of speakers.

Abstract

Overlapped speech is notoriously problematic for speaker diarization systems. Consequently, the use of speech separation has recently been proposed to improve their performance. Although promising, speech separation models struggle with realistic data because they are trained on simulated mixtures with a fixed number of speakers. In this work, we introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings with a variable number of speakers, as present in the AMI corpus. We envisage ConvTasNet and DPRNN as alternatives for the separation networks, with two or three output sources. To obtain the speaker diarization result, voice activity detection is applied on each estimated source. The final model is fine-tuned end-to-end, after first adapting the separation to real data using AMI. The system operates on short segments, and inference is performed by stitching the local predictions using speaker embeddings and incremental clustering. The results show that our system improves the state-of-the-art on the AMI headset mix, using no oracle information and under full evaluation (no collar and including overlapped speech). Finally, we show the strength of our system particularly on overlapped speech sections.

Online speaker diarization of meetings guided by speech separation

TL;DR

This paper tackles overlapped speech in online speaker diarization by extending the speech separation guided diarization (SSGD) framework to multi-speaker meetings. It combines a sliding-window speech separation model (ConvTasNet or DPRNN) with per-source VAD and a speaker-embedding–based stitching mechanism to perform incremental clustering across time, enabling online diarization without oracle information. End-to-end finetuning on real AMI data, including joint adaptation of the separation model and VAD, yields state-of-the-art DER on the AMI headset mix under full evaluation, with particular strength on overlapped speech regions. The approach is robust across different SSep architectures and output configurations, and demonstrates practical viability for real-time meeting diarization with variable numbers of speakers.

Abstract

Overlapped speech is notoriously problematic for speaker diarization systems. Consequently, the use of speech separation has recently been proposed to improve their performance. Although promising, speech separation models struggle with realistic data because they are trained on simulated mixtures with a fixed number of speakers. In this work, we introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings with a variable number of speakers, as present in the AMI corpus. We envisage ConvTasNet and DPRNN as alternatives for the separation networks, with two or three output sources. To obtain the speaker diarization result, voice activity detection is applied on each estimated source. The final model is fine-tuned end-to-end, after first adapting the separation to real data using AMI. The system operates on short segments, and inference is performed by stitching the local predictions using speaker embeddings and incremental clustering. The results show that our system improves the state-of-the-art on the AMI headset mix, using no oracle information and under full evaluation (no collar and including overlapped speech). Finally, we show the strength of our system particularly on overlapped speech sections.
Paper Structure (6 sections, 4 figures, 3 tables)

This paper contains 6 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Diagram of the inference process for local predictions on 5-s windows. The dataset used for the end-to-end finetuning is symbolized with chains.
  • Figure 2: Diagram of a single step of the stitching of local predictions.
  • Figure 3: All proposed online systems compared to Coria et al.'s, tested at minimum (0.5s, left bar) and maximum latency (5s, right bar). The DER is broken down into its constituents: Missed Speech (MS), False Alarm (FA) and Speaker Confusion (SC).
  • Figure 4: Performance of ConvTasNet5 trained on Libri5Mix on mixtures with 5, 4, 3 and 2 speakers. The red crosses show the performance of the SSep model with as many outputs as the speakers in the mixtures.