Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM
Zhaokai Sun, Li Zhang, Qing Wang, Pan Zhou, Lei Xie
TL;DR
The paper tackles robust overlapping speech detection (OSD) in multi-party conversations, where accurate overlap localization is hindered by speaker identities and data variability. It introduces a speaker-aware progressive OSD architecture that combines a pretrained WavLM SSL encoder, a frame-level CampPlus speaker attention module, a temporal masking stage guided by VAD, and separate VAD/OSD decoders trained in a two-stage, progressive manner. The method leverages pretraining on LibriHeavyMix and fine-tuning on real data (AliMeeting and AMI), with a fuzzy labeling scheme to handle annotation boundaries. Empirical results on AMI demonstrate a new state-of-the-art F1 of $82.76\%$, with ablations confirming the value of the speaker attention and progressive modeling for improving recall and reducing false alarms, highlighting the approach's robustness and practical impact for diarization and ASR tasks.
Abstract
Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation, a critical challenge in multi-party speech processing. This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks such as voice activity detection (VAD) and overlap detection. To improve acoustic representation, we explore the effectiveness of state-of-the-art self-supervised learning (SSL) models, including WavLM and wav2vec 2.0, while incorporating a speaker attention module to enrich features with frame-level speaker information. Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76\% on the AMI test set, demonstrating its robustness and effectiveness in OSD.
