Table of Contents
Fetching ...

Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads

Salah Zaiem, Youcef Kemiche, Titouan Parcollet, Slim Essid, Mirco Ravanelli

TL;DR

This work examines how the common practice of benchmarking speech SSL models with fixed, low-capacity probing heads biases model rankings. By evaluating state-of-the-art SSL encoders across a wide set of tasks with two probing-head sets of differing capacity, it demonstrates that larger-capacity decoders substantially alter performance rankings, improve multi-level feature exploitation, and enhance out-of-domain generalization, while incurring manageable inference costs. It also shows that headless evaluation methods such as ABX/AX do not reliably predict downstream performance, highlighting limitations of intrinsic benchmarks. The authors release MP3S within SpeechBrain to enable reproducible, multi-probe benchmarking and advocate for benchmarks that consider more-capacity probing heads to better reflect real-world deployment needs.

Abstract

Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has been growing, most proposals rely upon a single downstream architecture that maps the frozen SSL representations to the task labels. This study examines how benchmarking results are affected by changes in the probing head architecture. Interestingly, we found that altering the downstream architecture structure leads to significant fluctuations in the performance ranking of the evaluated models. Against common practices in speech SSL benchmarking, we evaluate larger-capacity probing heads, showing their impact on performance, inference costs, generalization and multi-level feature exploitation.

Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads

TL;DR

This work examines how the common practice of benchmarking speech SSL models with fixed, low-capacity probing heads biases model rankings. By evaluating state-of-the-art SSL encoders across a wide set of tasks with two probing-head sets of differing capacity, it demonstrates that larger-capacity decoders substantially alter performance rankings, improve multi-level feature exploitation, and enhance out-of-domain generalization, while incurring manageable inference costs. It also shows that headless evaluation methods such as ABX/AX do not reliably predict downstream performance, highlighting limitations of intrinsic benchmarks. The authors release MP3S within SpeechBrain to enable reproducible, multi-probe benchmarking and advocate for benchmarks that consider more-capacity probing heads to better reflect real-world deployment needs.

Abstract

Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has been growing, most proposals rely upon a single downstream architecture that maps the frozen SSL representations to the task labels. This study examines how benchmarking results are affected by changes in the probing head architecture. Interestingly, we found that altering the downstream architecture structure leads to significant fluctuations in the performance ranking of the evaluated models. Against common practices in speech SSL benchmarking, we evaluate larger-capacity probing heads, showing their impact on performance, inference costs, generalization and multi-level feature exploitation.
Paper Structure (18 sections, 6 equations, 4 figures, 6 tables)

This paper contains 18 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Performance vs mean total inference cost metrics (in G-MACs) depending on the probing heads used for three models and three different downstream tasks. On all tasks, second downstream probes, larger in capacity, allow smaller SSL models to bridge the gap with bigger ones in term of accuracy with limited additional inference costs. $DS(i)$ for $i \in {1,2}$ corresponds to the results obtained with the $i-th$ set of downstream probes.
  • Figure 2: Values of the layer weights learned during fine-tuning for all "Base" models on the considered tasks. The values on every row sum to $1$. The weights obtained with the second downstream probes (bottom part of the figure) are shifted to lower-level layers compared to the first probes ones (top part).
  • Figure 3: Generalization performances for automatic speaker verification. CN-Celeb Speech and CN-Celeb Song performances are provided in a zero-shot generalization setting and are not included in the training set. Random performance is at 50 EER, and is not shown for better visualization. Larger probing heads, here ECAPA-TDNN, shown in the right plot, generalize better to out-of-distribution testing samples.
  • Figure 4: Generalization performances for emotion recognition. CREMA-D and ASVP-ESD performance is tested in a zero-shot setting. The dashed blue line represents the random accuracy level. Larger probing heads, here ECAPA-TDNN, shown in the right plot, generalize better to out-of-distribution testing samples.