Table of Contents
Fetching ...

Pushing the Frontiers of Self-Distillation Prototypes Network with Dimension Regularization and Score Normalization

Yafeng Chen, Chong Deng, Hui Wang, Yiheng Jiang, Han Yin, Qian Chen, Wen Wang

TL;DR

The paper tackles the lack of speaker labels in self-supervised speaker verification and the collapse risk in non-contrastive learning. It proposes two main innovations within the Self-Distillation Prototypes Network (SDPN): (i) dimension regularization, including off-diagonal and Frobenius terms, to boost embedding diversity and prevent redundancy, and (ii) score normalization, notably AS-norm, to mitigate score drift without labels. Empirically, the approach yields state-of-the-art VoxCeleb1 results, with EERs of $1.29\%$, $1.60\%$, and $2.80\%$ on VoxCeleb1-O/E/H when combined with Frobenius regularization and AS-norm, representing substantial relative improvements over prior self-supervised methods. Overall, the method narrows the gap to supervised SV while maintaining the efficiency of self-supervised training, validated on large-scale VoxCeleb data.

Abstract

Developing robust speaker verification (SV) systems without speaker labels has been a longstanding challenge. Earlier research has highlighted a considerable performance gap between self-supervised and fully supervised approaches. In this paper, we enhance the non-contrastive self-supervised framework, Self-Distillation Prototypes Network (SDPN), by introducing dimension regularization that explicitly addresses the collapse problem through the application of regularization terms to speaker embeddings. Moreover, we integrate score normalization techniques from fully supervised SV to further bridge the gap toward supervised verification performance. SDPN with dimension regularization and score normalization sets a new state-of-the-art on the VoxCeleb1 speaker verification evaluation benchmark, achieving Equal Error Rate 1.29%, 1.60%, and 2.80% for trial VoxCeleb1-{O,E,H} respectively. These results demonstrate relative improvements of 28.3%, 19.6%, and 22.6% over the current best self-supervised methods, thereby advancing the frontiers of SV technology.

Pushing the Frontiers of Self-Distillation Prototypes Network with Dimension Regularization and Score Normalization

TL;DR

The paper tackles the lack of speaker labels in self-supervised speaker verification and the collapse risk in non-contrastive learning. It proposes two main innovations within the Self-Distillation Prototypes Network (SDPN): (i) dimension regularization, including off-diagonal and Frobenius terms, to boost embedding diversity and prevent redundancy, and (ii) score normalization, notably AS-norm, to mitigate score drift without labels. Empirically, the approach yields state-of-the-art VoxCeleb1 results, with EERs of , , and on VoxCeleb1-O/E/H when combined with Frobenius regularization and AS-norm, representing substantial relative improvements over prior self-supervised methods. Overall, the method narrows the gap to supervised SV while maintaining the efficiency of self-supervised training, validated on large-scale VoxCeleb data.

Abstract

Developing robust speaker verification (SV) systems without speaker labels has been a longstanding challenge. Earlier research has highlighted a considerable performance gap between self-supervised and fully supervised approaches. In this paper, we enhance the non-contrastive self-supervised framework, Self-Distillation Prototypes Network (SDPN), by introducing dimension regularization that explicitly addresses the collapse problem through the application of regularization terms to speaker embeddings. Moreover, we integrate score normalization techniques from fully supervised SV to further bridge the gap toward supervised verification performance. SDPN with dimension regularization and score normalization sets a new state-of-the-art on the VoxCeleb1 speaker verification evaluation benchmark, achieving Equal Error Rate 1.29%, 1.60%, and 2.80% for trial VoxCeleb1-{O,E,H} respectively. These results demonstrate relative improvements of 28.3%, 19.6%, and 22.6% over the current best self-supervised methods, thereby advancing the frontiers of SV technology.

Paper Structure

This paper contains 16 sections, 10 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of the SDPN framework with dimension regularization: It includes teacher and student networks with identical architectures but different parameters. The teacher network's outputs serve as targets to optimize the student network. Diversity regularization reduces the correlation between feature dimensions.
  • Figure 2: The t-SNE visualization presents extracted embeddings for five speakers, each represented by a different color. The left figure shows embeddings from SDPN, while the right illustrates those from SDPN with dimension regularization. The embeddings with dimension regularization demonstrate improved separation, indicating enhanced discriminability.