A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition

Zhenyu Zhou; Shibiao Xu; Shi Yin; Lantian Li; Dong Wang

A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition

Zhenyu Zhou, Shibiao Xu, Shi Yin, Lantian Li, Dong Wang

TL;DR

The study addresses how speaker augmentation via SP and VTLP can augment speaker recognition training data by generating new pseudo speakers, thereby enriching the embedding space. It conducts a controlled, 공개 comparative analysis on VoxCeleb1 and CN-Celeb1 using a ResNet34-SE/x-vector pipeline, evaluating perturbation factor $\alpha$ and deviation measures to diagnose when augmented data remains intelligible while introducing meaningful speaker variation. It finds that both SP and VTLP are effective at improving recognition, with SP typically offering stronger gains; their effects differ with data complexity and the chosen $\alpha$, and combining them can yield additional improvements on VoxCeleb with careful fusion design. The work emphasizes the potential of speaker augmentation to expand speaker diversity and suggests future directions including optimal fusion strategies and exploring other speech morphing techniques.

Abstract

Data augmentation (DA) has played a pivotal role in the success of deep speaker recognition. Current DA techniques primarily focus on speaker-preserving augmentation, which does not change the speaker trait of the speech and does not create new speakers. Recent research has shed light on the potential of speaker augmentation, which generates new speakers to enrich the training dataset. In this study, we delve into two speaker augmentation approaches: speed perturbation (SP) and vocal tract length perturbation (VTLP). Despite the empirical utilization of both methods, a comprehensive investigation into their efficacy is lacking. Our study, conducted using two public datasets, VoxCeleb and CN-Celeb, revealed that both SP and VTLP are proficient at generating new speakers, leading to significant performance improvements in speaker recognition. Furthermore, they exhibit distinct properties in sensitivity to perturbation factors and data complexity, hinting at the potential benefits of their fusion. Our research underscores the substantial potential of speaker augmentation, highlighting the importance of in-depth exploration and analysis.

A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition

TL;DR

and deviation measures to diagnose when augmented data remains intelligible while introducing meaningful speaker variation. It finds that both SP and VTLP are effective at improving recognition, with SP typically offering stronger gains; their effects differ with data complexity and the chosen

, and combining them can yield additional improvements on VoxCeleb with careful fusion design. The work emphasizes the potential of speaker augmentation to expand speaker diversity and suggests future directions including optimal fusion strategies and exploring other speech morphing techniques.

Abstract

Paper Structure (18 sections, 3 equations, 2 figures, 3 tables)

This paper contains 18 sections, 3 equations, 2 figures, 3 tables.

Introduction
Related Work
Review for SP and VTLP
Speed Perturbation (SP)
Vocal Tract Length Perturbation (VTLP)
SP vs. VTLP
Experiments
Data
Settings
Deviation Analysis
Principle of speaker augmentation
No-distortion range
Deviation distribution curve
Deviation-Perturbation curve
Speaker Recognition Results
...and 3 more sections

Figures (2)

Figure 1: The deviation distribution curves. Curves with $\alpha>1$ are plotted in dotted lines, while curves with $\alpha<1$ are plotted in solid lines.
Figure 2: The deviation-perturbation curve with (a) SP and (b) VTLP.

A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition

TL;DR

Abstract

A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (2)