Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering

Quan Wang; Yiling Huang; Han Lu; Guanlong Zhao; Ignacio Lopez Moreno

Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering

Quan Wang, Yiling Huang, Han Lu, Guanlong Zhao, Ignacio Lopez Moreno

TL;DR

This work tackles real-time, on-device speaker diarization under tight CPU, memory, and power budgets across inputs of varying lengths. It introduces a multi-stage clustering framework that combines AHC and spectral clustering with dynamic compression, enforcing upper bounds $U_1$ and $U_2$ to bound time and memory while maintaining accuracy. Short-form inputs benefit from an AHC fallback, medium-length inputs are clustered spectrally to estimate speaker counts, and long-form inputs are compressed via a pre-clusterer before final clustering, with a caching mechanism to preserve bounded cost. On-device CPU benchmarks (e.g., Pixel 4) and DER analyses on multiple datasets demonstrate practical viability and tunable trade-offs between clustering quality and resource usage, enabling robust streaming diarization on mobile devices.

Abstract

While recent research advances in speaker diarization mostly focus on improving the quality of diarization results, there is also an increasing interest in improving the efficiency of diarization systems. In this paper, we demonstrate that a multi-stage clustering strategy that uses different clustering algorithms for input of different lengths can address multi-faceted challenges of on-device speaker diarization applications. Specifically, a fallback clusterer is used to handle short-form inputs; a main clusterer is used to handle medium-length inputs; and a pre-clusterer is used to compress long-form inputs before they are processed by the main clusterer. Both the main clusterer and the pre-clusterer can be configured with an upper bound of the computational complexity to adapt to devices with different resource constraints. This multi-stage clustering strategy is critical for streaming on-device speaker diarization systems, where the budgets of CPU, memory and battery are tight.

Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering

TL;DR

and

to bound time and memory while maintaining accuracy. Short-form inputs benefit from an AHC fallback, medium-length inputs are clustered spectrally to estimate speaker counts, and long-form inputs are compressed via a pre-clusterer before final clustering, with a caching mechanism to preserve bounded cost. On-device CPU benchmarks (e.g., Pixel 4) and DER analyses on multiple datasets demonstrate practical viability and tunable trade-offs between clustering quality and resource usage, enabling robust streaming diarization on mobile devices.

Abstract

Paper Structure (27 sections, 3 figures, 6 tables)

This paper contains 27 sections, 3 figures, 6 tables.

Introduction
Baseline system
Feature frontend
Speaker turn detection
Model architecture
Training data
Speaker encoder
Model architecture
Training data
Constraints of the baseline system
Multi-stage clustering
Speaker turn based decision
Fallback clusterer
Main clusterer
Pre-clusterer
...and 12 more sections

Figures (3)

Figure 1: Architecture of the speaker encoder model.
Figure 2: Diagram of the multi-stage clustering strategy. $L$ and $U_1$ are the lower bound and upper bound of the main clusterer, respectively. $U_2$ is the upper bound of the pre-clusterer.
Figure 3: Plot of the multi-stage clustering runtime cost on a Pixel 4 mobile device. $L$ and $U_1$ are the lower bound and upper bound of the main clusterer, respectively; $U_2$ is the upper bound of the pre-clusterer. Roughly speaking, 100 inputs correspond to over 6 minutes, and 300 inputs correspond to approximately 20 minutes.

Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering

TL;DR

Abstract

Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering

Authors

TL;DR

Abstract

Table of Contents

Figures (3)