Pushing the Frontiers of Self-Distillation Prototypes Network with Dimension Regularization and Score Normalization
Yafeng Chen, Chong Deng, Hui Wang, Yiheng Jiang, Han Yin, Qian Chen, Wen Wang
TL;DR
The paper tackles the lack of speaker labels in self-supervised speaker verification and the collapse risk in non-contrastive learning. It proposes two main innovations within the Self-Distillation Prototypes Network (SDPN): (i) dimension regularization, including off-diagonal and Frobenius terms, to boost embedding diversity and prevent redundancy, and (ii) score normalization, notably AS-norm, to mitigate score drift without labels. Empirically, the approach yields state-of-the-art VoxCeleb1 results, with EERs of $1.29\%$, $1.60\%$, and $2.80\%$ on VoxCeleb1-O/E/H when combined with Frobenius regularization and AS-norm, representing substantial relative improvements over prior self-supervised methods. Overall, the method narrows the gap to supervised SV while maintaining the efficiency of self-supervised training, validated on large-scale VoxCeleb data.
Abstract
Developing robust speaker verification (SV) systems without speaker labels has been a longstanding challenge. Earlier research has highlighted a considerable performance gap between self-supervised and fully supervised approaches. In this paper, we enhance the non-contrastive self-supervised framework, Self-Distillation Prototypes Network (SDPN), by introducing dimension regularization that explicitly addresses the collapse problem through the application of regularization terms to speaker embeddings. Moreover, we integrate score normalization techniques from fully supervised SV to further bridge the gap toward supervised verification performance. SDPN with dimension regularization and score normalization sets a new state-of-the-art on the VoxCeleb1 speaker verification evaluation benchmark, achieving Equal Error Rate 1.29%, 1.60%, and 2.80% for trial VoxCeleb1-{O,E,H} respectively. These results demonstrate relative improvements of 28.3%, 19.6%, and 22.6% over the current best self-supervised methods, thereby advancing the frontiers of SV technology.
