Table of Contents
Fetching ...

What and When to Learn: CURriculum Ranking Loss for Large-Scale Speaker Verification

Massa Baali, Sarthak Bisht, Rita Singh, Bhiksha Raj

Abstract

Speaker verification at large scale remains an open challenge as fixed-margin losses treat all samples equally regardless of quality. We hypothesize that mislabeled or degraded samples introduce noisy gradients that disrupt compact speaker manifolds. We propose Curry (CURriculum Ranking), an adaptive loss that estimates sample difficulty online via Sub-center ArcFace: confidence scores from dominant sub-center cosine similarity rank samples into easy, medium, and hard tiers using running batch statistics, without auxiliary annotations. Learnable weights guide the model from stable identity foundations through manifold refinement to boundary sharpening. To our knowledge, this is the largest-scale speaker verification system trained to date. Evaluated on VoxCeleb1-O, and SITW, Curry reduces EER by 86.8\% and 60.0\% over the Sub-center ArcFace baseline, establishing a new paradigm for robust speaker verification on imperfect large-scale data.

What and When to Learn: CURriculum Ranking Loss for Large-Scale Speaker Verification

Abstract

Speaker verification at large scale remains an open challenge as fixed-margin losses treat all samples equally regardless of quality. We hypothesize that mislabeled or degraded samples introduce noisy gradients that disrupt compact speaker manifolds. We propose Curry (CURriculum Ranking), an adaptive loss that estimates sample difficulty online via Sub-center ArcFace: confidence scores from dominant sub-center cosine similarity rank samples into easy, medium, and hard tiers using running batch statistics, without auxiliary annotations. Learnable weights guide the model from stable identity foundations through manifold refinement to boundary sharpening. To our knowledge, this is the largest-scale speaker verification system trained to date. Evaluated on VoxCeleb1-O, and SITW, Curry reduces EER by 86.8\% and 60.0\% over the Sub-center ArcFace baseline, establishing a new paradigm for robust speaker verification on imperfect large-scale data.

Paper Structure

This paper contains 12 sections, 4 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: The adaptive curriculum learning pipeline. The encoder $\mathcal{E}$ maps raw waveforms to speaker embeddings projected into a sub-center angular distance space, where each speaker region contains $K$ sub-centers capturing acoustic variability. Per-sample confidence scores from the dominant sub-center cosine similarity are ranked against running batch statistics $(\hat{\mu}, \hat{\sigma})$ into Hard, Medium, and Easy tiers. Learnable weights $W_H$, $W_M$, $W_E$ scale each sample's gradient contribution before loss reduction.
  • Figure 2: Cosine EER (%) evolution over training steps on a subset of the most difficult training instances. Curry (top) converges smoothly and stabilizes, while the Sub-center ArcFace baseline (bottom) plateaus above 4% and exhibits instability in later epochs.