A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition
Zhenyu Zhou, Shibiao Xu, Shi Yin, Lantian Li, Dong Wang
TL;DR
The study addresses how speaker augmentation via SP and VTLP can augment speaker recognition training data by generating new pseudo speakers, thereby enriching the embedding space. It conducts a controlled, 공개 comparative analysis on VoxCeleb1 and CN-Celeb1 using a ResNet34-SE/x-vector pipeline, evaluating perturbation factor $\alpha$ and deviation measures to diagnose when augmented data remains intelligible while introducing meaningful speaker variation. It finds that both SP and VTLP are effective at improving recognition, with SP typically offering stronger gains; their effects differ with data complexity and the chosen $\alpha$, and combining them can yield additional improvements on VoxCeleb with careful fusion design. The work emphasizes the potential of speaker augmentation to expand speaker diversity and suggests future directions including optimal fusion strategies and exploring other speech morphing techniques.
Abstract
Data augmentation (DA) has played a pivotal role in the success of deep speaker recognition. Current DA techniques primarily focus on speaker-preserving augmentation, which does not change the speaker trait of the speech and does not create new speakers. Recent research has shed light on the potential of speaker augmentation, which generates new speakers to enrich the training dataset. In this study, we delve into two speaker augmentation approaches: speed perturbation (SP) and vocal tract length perturbation (VTLP). Despite the empirical utilization of both methods, a comprehensive investigation into their efficacy is lacking. Our study, conducted using two public datasets, VoxCeleb and CN-Celeb, revealed that both SP and VTLP are proficient at generating new speakers, leading to significant performance improvements in speaker recognition. Furthermore, they exhibit distinct properties in sensitivity to perturbation factors and data complexity, hinting at the potential benefits of their fusion. Our research underscores the substantial potential of speaker augmentation, highlighting the importance of in-depth exploration and analysis.
