On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio Classification

Calum Heggan; Sam Budgett; Timothy Hospedales; Mehrdad Yaghoobi

On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio Classification

Calum Heggan, Sam Budgett, Timothy Hospedales, Mehrdad Yaghoobi

TL;DR

The paper addresses how well large-scale self-supervised audio models pretrained on LibriSpeech 960 transfer to $N$-Way $K$-Shot few-shot audio classification and how these results relate to the SUPERB benchmark. It evaluates 13 pretrained models as frozen feature extractors with linear probes across 10 diverse datasets, seeking state-of-the-art guidance for few-shot transfer. Key findings include a new state-of-the-art on SpeechCommandsV2, nuanced cross-task transfer with some domains aligning with SUPERB and others not, and the recommendation to integrate few-shot tasks into benchmark suites for robust evaluation in low-resource settings. These results inform benchmark design and practical deployment, highlighting both the transfer potential and the limitations of current self-supervised representations in few-shot audio tasks.

Abstract

In recent years, self-supervised learning has excelled for its capacity to learn robust feature representations from unlabelled data. Networks pretrained through self-supervision serve as effective feature extractors for downstream tasks, including Few-Shot Learning. While the evaluation of unsupervised approaches for few-shot learning is well-established in imagery, it is notably absent in acoustics. This study addresses this gap by assessing large-scale self-supervised models' performance in few-shot audio classification. Additionally, we explore the relationship between a model's few-shot learning capability and other downstream task benchmarks. Our findings reveal state-of-the-art performance in some few-shot problems such as SpeechCommandsv2, as well as strong correlations between speech-based few-shot problems and various downstream audio tasks.

On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio Classification

TL;DR

The paper addresses how well large-scale self-supervised audio models pretrained on LibriSpeech 960 transfer to

-Way

-Shot few-shot audio classification and how these results relate to the SUPERB benchmark. It evaluates 13 pretrained models as frozen feature extractors with linear probes across 10 diverse datasets, seeking state-of-the-art guidance for few-shot transfer. Key findings include a new state-of-the-art on SpeechCommandsV2, nuanced cross-task transfer with some domains aligning with SUPERB and others not, and the recommendation to integrate few-shot tasks into benchmark suites for robust evaluation in low-resource settings. These results inform benchmark design and practical deployment, highlighting both the transfer potential and the limitations of current self-supervised representations in few-shot audio tasks.

Abstract

Paper Structure (15 sections, 2 equations, 2 figures, 3 tables)

This paper contains 15 sections, 2 equations, 2 figures, 3 tables.

Introduction
Related Work
Few-Shot Learning
Self-Supervision
Benchmarks & Evaluations
Self-Supervision For Few-Shot Learning
Setup
Models & Pre-Training
Few-Shot Evaluation & SUPERB
Correlation
Limitations
RESULTS
Few-Shot Performance
Relationship to SUPERB
Conclusion

Figures (2)

Figure 1: SUPERB model score vs average few-shot transfer performance for all considered datasets. (TOP) row contains speech datasets, (BOTTOM) row contains environmental/animal sets. Regression gradients and shaded regions describe correlation strength and 95% confidence intervals respectively. Spearman Rank correlation coefficients (c) are shown top left of each plot.
Figure 2: Spearman rank correlations between Few-Shot (rows) and SUPERB (cols) tasks. Few-shot tasks are split into speech (top), environment (mid) and animal (bottom) sounds. SUPERB is split into context, speaker, semantics and paralinguistics (left to right).

On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio Classification

TL;DR

Abstract

On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (2)