Table of Contents
Fetching ...

Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels

Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Li-Wei Chen, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald

TL;DR

This work tackles unsupervised speaker representation learning by adopting iterative pseudo-labeling (IPL) and showing that a simple i-vector generative model can bootstrap the process. It systematically analyzes how the initial model, encoder, augmentations, clustering, and i-vector vs. DINO baselines affect IPL performance, demonstrating that i-vectors can rival state-of-the-art methods after several IPL iterations. The key contributions are (1) demonstrating effective IPL bootstrapping with i-vectors, and (2) providing a detailed ablation study that clarifies which components most influence convergence and accuracy. The findings offer practical guidance for designing unsupervised speaker systems, highlighting that well-chosen clustering and encoders, together with augmentations, can compensate for weaker initial representations and reduce reliance on heavy self-supervised pretraining.

Abstract

Iterative self-training, or iterative pseudo-labeling (IPL) -- using an improved model from the current iteration to provide pseudo-labels for the next iteration -- has proven to be a powerful approach to enhance the quality of speaker representations. Recent applications of IPL in unsupervised speaker recognition start with representations extracted from very elaborate self-supervised methods (e.g., DINO). However, training such strong self-supervised models is not straightforward (they require hyper-parameter tuning and may not generalize to out-of-domain data) and, moreover, may not be needed at all. To this end, we show that the simple, well-studied, and established i-vector generative model is enough to bootstrap the IPL process for the unsupervised learning of speaker representations. We also systematically study the impact of other components on the IPL process, which includes the initial model, the encoder, augmentations, the number of clusters, and the clustering algorithm. Remarkably, we find that even with a simple and significantly weaker initial model like i-vector, IPL can still achieve speaker verification performance that rivals state-of-the-art methods.

Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels

TL;DR

This work tackles unsupervised speaker representation learning by adopting iterative pseudo-labeling (IPL) and showing that a simple i-vector generative model can bootstrap the process. It systematically analyzes how the initial model, encoder, augmentations, clustering, and i-vector vs. DINO baselines affect IPL performance, demonstrating that i-vectors can rival state-of-the-art methods after several IPL iterations. The key contributions are (1) demonstrating effective IPL bootstrapping with i-vectors, and (2) providing a detailed ablation study that clarifies which components most influence convergence and accuracy. The findings offer practical guidance for designing unsupervised speaker systems, highlighting that well-chosen clustering and encoders, together with augmentations, can compensate for weaker initial representations and reduce reliance on heavy self-supervised pretraining.

Abstract

Iterative self-training, or iterative pseudo-labeling (IPL) -- using an improved model from the current iteration to provide pseudo-labels for the next iteration -- has proven to be a powerful approach to enhance the quality of speaker representations. Recent applications of IPL in unsupervised speaker recognition start with representations extracted from very elaborate self-supervised methods (e.g., DINO). However, training such strong self-supervised models is not straightforward (they require hyper-parameter tuning and may not generalize to out-of-domain data) and, moreover, may not be needed at all. To this end, we show that the simple, well-studied, and established i-vector generative model is enough to bootstrap the IPL process for the unsupervised learning of speaker representations. We also systematically study the impact of other components on the IPL process, which includes the initial model, the encoder, augmentations, the number of clusters, and the clustering algorithm. Remarkably, we find that even with a simple and significantly weaker initial model like i-vector, IPL can still achieve speaker verification performance that rivals state-of-the-art methods.
Paper Structure (20 sections, 1 equation, 3 figures, 2 tables)

This paper contains 20 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The unsupervised speaker iterative pseudo-labeling (speaker-IPL) framework. For a given iteration $q$, we train a speaker encoder $g^q(\cdot)$ and a projector $f^q(\cdot)$ to predict pseudo-labels generated by clustering representations from the encoder $g^{q-1}(\cdot)$. $x'$ is the augmented speech segment and $x$ is the unaltered speech sample. $g^0(\cdot)$ is an unsupervised i-vector model. Blocks in black indicate trainable components, while other blocks indicate non-trainable components.
  • Figure 2: Changing components of the speaker-IPL impacts both performance and convergence trends. The performance is reported on the Vox$1$-O.
  • Figure 3: Changing components of the speaker-IPL impacts both performance and convergence trends. The performance is reported on the VoxSRC-$20$ (test).