Enhancing Open-Set Speaker Identification through Rapid Tuning with Speaker Reciprocal Points and Negative Sample

Zhiyong Chen; Zhiqi Ai; Xinnuo Li; Shugong Xu

Enhancing Open-Set Speaker Identification through Rapid Tuning with Speaker Reciprocal Points and Negative Sample

Zhiyong Chen, Zhiqi Ai, Xinnuo Li, Shugong Xu

TL;DR

A novel framework for open-set speaker identification in household environments, employing task-optimized Speaker Reciprocal Points Learning (SRPL) to enhance discrimination across multiple target speakers and an enhanced version of SRPL (SRPL+), which incorporates negative sample learning with both speech-synthesized and real negative samples to significantly improve open-set SID accuracy.

Abstract

This paper introduces a novel framework for open-set speaker identification in household environments, playing a crucial role in facilitating seamless human-computer interactions. Addressing the limitations of current speaker models and classification approaches, our work integrates an pretrained WavLM frontend with a few-shot rapid tuning neural network (NN) backend for enrollment, employing task-optimized Speaker Reciprocal Points Learning (SRPL) to enhance discrimination across multiple target speakers. Furthermore, we propose an enhanced version of SRPL (SRPL+), which incorporates negative sample learning with both speech-synthesized and real negative samples to significantly improve open-set SID accuracy. Our approach is thoroughly evaluated across various multi-language text-dependent speaker recognition datasets, demonstrating its effectiveness in achieving high usability for complex household multi-speaker recognition scenarios. The proposed system enhanced open-set performance by up to 27\% over the directly use of efficient WavLM base+ model.

Enhancing Open-Set Speaker Identification through Rapid Tuning with Speaker Reciprocal Points and Negative Sample

TL;DR

Abstract

Paper Structure (14 sections, 10 equations, 4 figures, 2 tables)

This paper contains 14 sections, 10 equations, 4 figures, 2 tables.

Introduction
Speaker Reciprocal Points Learning for Open-set Speaker Identification
Rapid Downstream Tuning Approach with Speaker Reciprocal Points Learning (SRPL)
SRPL Enhancement with Negative Samples (SRPL+)
Zero-shot Speech Synthesis Process for Negative Samples Synthesis
Learning Enhancement with Negative Audio Instance
Experiments Settings
Datasets
Metrics
Training and Inference Details
Results
Comparative Evaluation of SRPL with Baselines
Supplementary Analyses for SRPL for Open-set SID
Conclusions

Figures (4)

Figure 1: Illustration of open-set speaker identification architecture: customization via audio large model (LM) with SRPL-based backend rapid tuning.
Figure 2: Conceptual illustration of the embedding space for various open-set training losses. Figure is adapted from chen2021adversarial.
Figure 3: SRPL and its enhanced SRPL+ with integration of negative samples.
Figure 4: t-SNE visualization of speaker embeddings for targets and outliers in testing datasets for SRPL+ and baseline systems.

Enhancing Open-Set Speaker Identification through Rapid Tuning with Speaker Reciprocal Points and Negative Sample

TL;DR

Abstract

Enhancing Open-Set Speaker Identification through Rapid Tuning with Speaker Reciprocal Points and Negative Sample

Authors

TL;DR

Abstract

Table of Contents

Figures (4)