Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels
Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Li-Wei Chen, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald
TL;DR
This work tackles unsupervised speaker representation learning by adopting iterative pseudo-labeling (IPL) and showing that a simple i-vector generative model can bootstrap the process. It systematically analyzes how the initial model, encoder, augmentations, clustering, and i-vector vs. DINO baselines affect IPL performance, demonstrating that i-vectors can rival state-of-the-art methods after several IPL iterations. The key contributions are (1) demonstrating effective IPL bootstrapping with i-vectors, and (2) providing a detailed ablation study that clarifies which components most influence convergence and accuracy. The findings offer practical guidance for designing unsupervised speaker systems, highlighting that well-chosen clustering and encoders, together with augmentations, can compensate for weaker initial representations and reduce reliance on heavy self-supervised pretraining.
Abstract
Iterative self-training, or iterative pseudo-labeling (IPL) -- using an improved model from the current iteration to provide pseudo-labels for the next iteration -- has proven to be a powerful approach to enhance the quality of speaker representations. Recent applications of IPL in unsupervised speaker recognition start with representations extracted from very elaborate self-supervised methods (e.g., DINO). However, training such strong self-supervised models is not straightforward (they require hyper-parameter tuning and may not generalize to out-of-domain data) and, moreover, may not be needed at all. To this end, we show that the simple, well-studied, and established i-vector generative model is enough to bootstrap the IPL process for the unsupervised learning of speaker representations. We also systematically study the impact of other components on the IPL process, which includes the initial model, the encoder, augmentations, the number of clusters, and the clustering algorithm. Remarkably, we find that even with a simple and significantly weaker initial model like i-vector, IPL can still achieve speaker verification performance that rivals state-of-the-art methods.
