Table of Contents
Fetching ...

Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation

Luyao Cheng, Siqi Zheng, Qinglin Zhang, Hui Wang, Yafeng Chen, Qian Chen, Shiliang Zhang

TL;DR

The paper tackles the problem of diarization under challenging acoustic conditions by exploiting semantic information from transcripts. It introduces Joint Pairwise Constraints Propagation (JPCP), which injects speaker-related semantic cues into clustering through must-link and cannot-link constraints, embedded into both embedding normalization and affinity refinement via constraint propagation. The approach combines SSDR-based constrained embedding normalization and a refined affinity function with enhanced constraint propagation (E^2CPM) to propagate sparse semantic constraints. Experiments on AISHELL-4 show that semantic constraints yield consistent improvements over acoustic-only baselines, with notable reductions in Text Diarization Error Rate (TextDER) and improved speaker count accuracy, and simulated constraints indicate upper-bound potential. The framework is modular and compatible with existing SD pipelines, suggesting practical impact for robust diarization in real-world meetings as language models and ASR improve.

Abstract

Speaker diarization has gained considerable attention within speech processing research community. Mainstream speaker diarization rely primarily on speakers' voice characteristics extracted from acoustic signals and often overlook the potential of semantic information. Considering the fact that speech signals can efficiently convey the content of a speech, it is of our interest to fully exploit these semantic cues utilizing language models. In this work we propose a novel approach to effectively leverage semantic information in clustering-based speaker diarization systems. Firstly, we introduce spoken language understanding modules to extract speaker-related semantic information and utilize these information to construct pairwise constraints. Secondly, we present a novel framework to integrate these constraints into the speaker diarization pipeline, enhancing the performance of the entire system. Extensive experiments conducted on the public dataset demonstrate the consistent superiority of our proposed approach over acoustic-only speaker diarization systems.

Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation

TL;DR

The paper tackles the problem of diarization under challenging acoustic conditions by exploiting semantic information from transcripts. It introduces Joint Pairwise Constraints Propagation (JPCP), which injects speaker-related semantic cues into clustering through must-link and cannot-link constraints, embedded into both embedding normalization and affinity refinement via constraint propagation. The approach combines SSDR-based constrained embedding normalization and a refined affinity function with enhanced constraint propagation (E^2CPM) to propagate sparse semantic constraints. Experiments on AISHELL-4 show that semantic constraints yield consistent improvements over acoustic-only baselines, with notable reductions in Text Diarization Error Rate (TextDER) and improved speaker count accuracy, and simulated constraints indicate upper-bound potential. The framework is modular and compatible with existing SD pipelines, suggesting practical impact for robust diarization in real-world meetings as language models and ASR improve.

Abstract

Speaker diarization has gained considerable attention within speech processing research community. Mainstream speaker diarization rely primarily on speakers' voice characteristics extracted from acoustic signals and often overlook the potential of semantic information. Considering the fact that speech signals can efficiently convey the content of a speech, it is of our interest to fully exploit these semantic cues utilizing language models. In this work we propose a novel approach to effectively leverage semantic information in clustering-based speaker diarization systems. Firstly, we introduce spoken language understanding modules to extract speaker-related semantic information and utilize these information to construct pairwise constraints. Secondly, we present a novel framework to integrate these constraints into the speaker diarization pipeline, enhancing the performance of the entire system. Extensive experiments conducted on the public dataset demonstrate the consistent superiority of our proposed approach over acoustic-only speaker diarization systems.
Paper Structure (15 sections, 6 equations, 3 figures, 1 table)

This paper contains 15 sections, 6 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: A sample of strategy for constructing constraints.
  • Figure 2: The pipeline is a traditional speaker diarization backend with acoustic information. The addtional pairwise constraints constructed from semantic information, including Must-Link and Cannot-Link, will be used in two parts: Embedding Normalization and Affinity Function.
  • Figure 3: The impact of pairwise constraints rate on both clus- tering metrics and the effectiveness of the overall speaker diarization system.