Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

Luyao Cheng; Hui Wang; Siqi Zheng; Yafeng Chen; Rongjie Huang; Qinglin Zhang; Qian Chen; Xihao Li

Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

Luyao Cheng, Hui Wang, Siqi Zheng, Yafeng Chen, Rongjie Huang, Qinglin Zhang, Qian Chen, Xihao Li

TL;DR

This work tackles speaker diarization in complex conversations by integrating audio, visual, and semantic cues within a constrained optimization framework. It introduces a joint pairwise constraint propagation approach that fuses must-link and cannot-link signals from visual and semantic sources with audio-derived embeddings to refine speaker affinity prior to clustering. The method yields significant improvements across DER, JER, NMI, ARI, TextDER, and CpWER on diverse multimodal datasets, demonstrating robust generalization beyond single- or bi-modal systems. By explicitly modeling cross-modal constraints and propagating them through an affinity graph, the framework provides a scalable, principled path toward more accurate and reliable multimodal diarization.

Abstract

Speaker diarization, the process of segmenting an audio stream or transcribed speech content into homogenous partitions based on speaker identity, plays a crucial role in the interpretation and analysis of human speech. Most existing speaker diarization systems rely exclusively on unimodal acoustic information, making the task particularly challenging due to the innate ambiguities of audio signals. Recent studies have made tremendous efforts towards audio-visual or audio-semantic modeling to enhance performance. However, even the incorporation of up to two modalities often falls short in addressing the complexities of spontaneous and unstructured conversations. To exploit more meaningful dialogue patterns, we propose a novel multimodal approach that jointly utilizes audio, visual, and semantic cues to enhance speaker diarization. Our method elegantly formulates the multimodal modeling as a constrained optimization problem. First, we build insights into the visual connections among active speakers and the semantic interactions within spoken content, thereby establishing abundant pairwise constraints. Then we introduce a joint pairwise constraint propagation algorithm to cluster speakers based on these visual and semantic constraints. This integration effectively leverages the complementary strengths of different modalities, refining the affinity estimation between individual speaker embeddings. Extensive experiments conducted on multiple multimodal datasets demonstrate that our approach consistently outperforms state-of-the-art speaker diarization methods.

Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

TL;DR

Abstract

Paper Structure (25 sections, 6 equations, 5 figures, 5 tables)

This paper contains 25 sections, 6 equations, 5 figures, 5 tables.

Introduction
Related Work
Audio-only Speaker diarization
Audio-visual Speaker diarization
Audio-textual Speaker diarization
Pairwise Constrained Clustering
Methods
Joint Pairwise Constraint Propagation with multimodal Information
Visual constraints construction
Semantic constraints construction
Experiments
Datasets
Implementation Details
Evaluation Metrics
Results and Discussion
...and 10 more sections

Figures (5)

Figure 1: An overview of our proposed multimodal speaker diarization system. It incorporates additional visual and textual processing modules that independently extract visual and semantic constraints. By integrating and propagating knowledge derived from these different insights, comprehensive multimodal pairwise constraints are generated, serving as a robust guidance for enhancing the audio-based diarization.
Figure 2: Semantic constraint construction based on dialogue detection and speaker-turn detection. Text segments judged as non-dialogue indicate that the associated embeddings are related through must-link constraints, depicted by solid connections below. Conversely, a detected transition point dictates that embeddings spanning this point should be connected with cannot-link constraints, as represented by dashed connections above.
Figure 3: Results of constrained speaker cluster performance across various levels of constraints coverage, showcasing scenarios with imbalanced proportions of must-link and cannot-link constraints.
Figure 4: Simulated constraints with errors and the effect for constrained clustering
Figure 5: Analysis of constrained clustering outcomes with varying $\lambda$ values. It is observed that when constructed constraints contain errors, the peak of the optimal $\lambda$ shifts towards 1.0.

Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

TL;DR

Abstract

Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

Authors

TL;DR

Abstract

Table of Contents

Figures (5)