Table of Contents
Fetching ...

System Description for the Displace Speaker Diarization Challenge 2023

Ali Aliyev

TL;DR

This work targets speaker diarization in the multilingual setting of Displace 2023, where speakers alternate English and Hindi. The authors present a three-stage system—a VAD module, a ResNet-based embedding extractor trained on multilingual data, and spectral clustering for segment grouping—with overlapped speech addressed via subsegment embeddings. By combining VoxCeleb2 with Common Voice Russian for multilingual training and using 2-s input windows with 80-d Mel features and AAM-Softmax loss, the method achieves DERs around 27% on development and phase-1 evaluation data, highlighting the VAD component as a key bottleneck. The study also shows spectral clustering outperforms hierarchical clustering in this setting and demonstrates the practical viability of their pipeline for multilingual diarization tasks.

Abstract

This paper describes our solution for the Diarization of Speaker and Language in Conversational Environments Challenge (Displace 2023). We used a combination of VAD for finding segfments with speech, Resnet architecture based CNN for feature extraction from these segments, and spectral clustering for features clustering. Even though it was not trained with using Hindi, the described algorithm achieves the following metrics: DER 27. 1% and DER 27. 4%, on the development and phase-1 evaluation parts of the dataset, respectively.

System Description for the Displace Speaker Diarization Challenge 2023

TL;DR

This work targets speaker diarization in the multilingual setting of Displace 2023, where speakers alternate English and Hindi. The authors present a three-stage system—a VAD module, a ResNet-based embedding extractor trained on multilingual data, and spectral clustering for segment grouping—with overlapped speech addressed via subsegment embeddings. By combining VoxCeleb2 with Common Voice Russian for multilingual training and using 2-s input windows with 80-d Mel features and AAM-Softmax loss, the method achieves DERs around 27% on development and phase-1 evaluation data, highlighting the VAD component as a key bottleneck. The study also shows spectral clustering outperforms hierarchical clustering in this setting and demonstrates the practical viability of their pipeline for multilingual diarization tasks.

Abstract

This paper describes our solution for the Diarization of Speaker and Language in Conversational Environments Challenge (Displace 2023). We used a combination of VAD for finding segfments with speech, Resnet architecture based CNN for feature extraction from these segments, and spectral clustering for features clustering. Even though it was not trained with using Hindi, the described algorithm achieves the following metrics: DER 27. 1% and DER 27. 4%, on the development and phase-1 evaluation parts of the dataset, respectively.
Paper Structure (17 sections, 9 tables)