On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio Classification
Calum Heggan, Sam Budgett, Timothy Hospedales, Mehrdad Yaghoobi
TL;DR
The paper addresses how well large-scale self-supervised audio models pretrained on LibriSpeech 960 transfer to $N$-Way $K$-Shot few-shot audio classification and how these results relate to the SUPERB benchmark. It evaluates 13 pretrained models as frozen feature extractors with linear probes across 10 diverse datasets, seeking state-of-the-art guidance for few-shot transfer. Key findings include a new state-of-the-art on SpeechCommandsV2, nuanced cross-task transfer with some domains aligning with SUPERB and others not, and the recommendation to integrate few-shot tasks into benchmark suites for robust evaluation in low-resource settings. These results inform benchmark design and practical deployment, highlighting both the transfer potential and the limitations of current self-supervised representations in few-shot audio tasks.
Abstract
In recent years, self-supervised learning has excelled for its capacity to learn robust feature representations from unlabelled data. Networks pretrained through self-supervision serve as effective feature extractors for downstream tasks, including Few-Shot Learning. While the evaluation of unsupervised approaches for few-shot learning is well-established in imagery, it is notably absent in acoustics. This study addresses this gap by assessing large-scale self-supervised models' performance in few-shot audio classification. Additionally, we explore the relationship between a model's few-shot learning capability and other downstream task benchmarks. Our findings reveal state-of-the-art performance in some few-shot problems such as SpeechCommandsv2, as well as strong correlations between speech-based few-shot problems and various downstream audio tasks.
