Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation
Luyao Cheng, Siqi Zheng, Qinglin Zhang, Hui Wang, Yafeng Chen, Qian Chen, Shiliang Zhang
TL;DR
The paper tackles the problem of diarization under challenging acoustic conditions by exploiting semantic information from transcripts. It introduces Joint Pairwise Constraints Propagation (JPCP), which injects speaker-related semantic cues into clustering through must-link and cannot-link constraints, embedded into both embedding normalization and affinity refinement via constraint propagation. The approach combines SSDR-based constrained embedding normalization and a refined affinity function with enhanced constraint propagation (E^2CPM) to propagate sparse semantic constraints. Experiments on AISHELL-4 show that semantic constraints yield consistent improvements over acoustic-only baselines, with notable reductions in Text Diarization Error Rate (TextDER) and improved speaker count accuracy, and simulated constraints indicate upper-bound potential. The framework is modular and compatible with existing SD pipelines, suggesting practical impact for robust diarization in real-world meetings as language models and ASR improve.
Abstract
Speaker diarization has gained considerable attention within speech processing research community. Mainstream speaker diarization rely primarily on speakers' voice characteristics extracted from acoustic signals and often overlook the potential of semantic information. Considering the fact that speech signals can efficiently convey the content of a speech, it is of our interest to fully exploit these semantic cues utilizing language models. In this work we propose a novel approach to effectively leverage semantic information in clustering-based speaker diarization systems. Firstly, we introduce spoken language understanding modules to extract speaker-related semantic information and utilize these information to construct pairwise constraints. Secondly, we present a novel framework to integrate these constraints into the speaker diarization pipeline, enhancing the performance of the entire system. Extensive experiments conducted on the public dataset demonstrate the consistent superiority of our proposed approach over acoustic-only speaker diarization systems.
