SCDNet: Self-supervised Learning Feature-based Speaker Change Detection
Yue Li, Xinsheng Wang, Li Zhang, Lei Xie
TL;DR
This work addresses Speaker Change Detection by leveraging self-supervised learning features within an end-to-end Conformer-based framework (SCDNet). It introduces a learnable layer-weighting fusion to identify the most informative SSL layer, and adds a contrastive loss to mitigate overfitting in frame-level SCD. Through extensive experiments on multiple real and artificial datasets, WavLm-based SSL representations—especially intermediate layers—consistently yield strong performance, with SCDNet achieving state-of-the-art results and the contrastive approach enhancing fine-tuning gains. The findings suggest SSL features, coupled with layer-aware fusion and contrastive learning, offer robust, scalable improvements for SCD with potential downstream benefits for ASR and captioning systems.
Abstract
Speaker Change Detection (SCD) is to identify boundaries among speakers in a conversation. Motivated by the success of fine-tuning wav2vec 2.0 models for the SCD task, a further investigation of self-supervised learning (SSL) features for SCD is conducted in this work. Specifically, an SCD model, named SCDNet, is proposed. With this model, various state-of-the-art SSL models, including Hubert, wav2vec 2.0, and WavLm are investigated. To discern the most potent layer of SSL models for SCD, a learnable weighting method is employed to analyze the effectiveness of intermediate representations. Additionally, a fine-tuning-based approach is also implemented to further compare the characteristics of SSL models in the SCD task. Furthermore, a contrastive learning method is proposed to mitigate the overfitting tendencies in the training of both the fine-tuning-based method and SCDNet. Experiments showcase the superiority of WavLm in the SCD task and also demonstrate the good design of SCDNet.
