Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads
Salah Zaiem, Youcef Kemiche, Titouan Parcollet, Slim Essid, Mirco Ravanelli
TL;DR
This work examines how the common practice of benchmarking speech SSL models with fixed, low-capacity probing heads biases model rankings. By evaluating state-of-the-art SSL encoders across a wide set of tasks with two probing-head sets of differing capacity, it demonstrates that larger-capacity decoders substantially alter performance rankings, improve multi-level feature exploitation, and enhance out-of-domain generalization, while incurring manageable inference costs. It also shows that headless evaluation methods such as ABX/AX do not reliably predict downstream performance, highlighting limitations of intrinsic benchmarks. The authors release MP3S within SpeechBrain to enable reproducible, multi-probe benchmarking and advocate for benchmarks that consider more-capacity probing heads to better reflect real-world deployment needs.
Abstract
Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has been growing, most proposals rely upon a single downstream architecture that maps the frozen SSL representations to the task labels. This study examines how benchmarking results are affected by changes in the probing head architecture. Interestingly, we found that altering the downstream architecture structure leads to significant fluctuations in the performance ranking of the evaluated models. Against common practices in speech SSL benchmarking, we evaluate larger-capacity probing heads, showing their impact on performance, inference costs, generalization and multi-level feature exploitation.
