Table of Contents
Fetching ...

On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio Classification

Calum Heggan, Sam Budgett, Timothy Hospedales, Mehrdad Yaghoobi

TL;DR

The paper addresses how well large-scale self-supervised audio models pretrained on LibriSpeech 960 transfer to $N$-Way $K$-Shot few-shot audio classification and how these results relate to the SUPERB benchmark. It evaluates 13 pretrained models as frozen feature extractors with linear probes across 10 diverse datasets, seeking state-of-the-art guidance for few-shot transfer. Key findings include a new state-of-the-art on SpeechCommandsV2, nuanced cross-task transfer with some domains aligning with SUPERB and others not, and the recommendation to integrate few-shot tasks into benchmark suites for robust evaluation in low-resource settings. These results inform benchmark design and practical deployment, highlighting both the transfer potential and the limitations of current self-supervised representations in few-shot audio tasks.

Abstract

In recent years, self-supervised learning has excelled for its capacity to learn robust feature representations from unlabelled data. Networks pretrained through self-supervision serve as effective feature extractors for downstream tasks, including Few-Shot Learning. While the evaluation of unsupervised approaches for few-shot learning is well-established in imagery, it is notably absent in acoustics. This study addresses this gap by assessing large-scale self-supervised models' performance in few-shot audio classification. Additionally, we explore the relationship between a model's few-shot learning capability and other downstream task benchmarks. Our findings reveal state-of-the-art performance in some few-shot problems such as SpeechCommandsv2, as well as strong correlations between speech-based few-shot problems and various downstream audio tasks.

On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio Classification

TL;DR

The paper addresses how well large-scale self-supervised audio models pretrained on LibriSpeech 960 transfer to -Way -Shot few-shot audio classification and how these results relate to the SUPERB benchmark. It evaluates 13 pretrained models as frozen feature extractors with linear probes across 10 diverse datasets, seeking state-of-the-art guidance for few-shot transfer. Key findings include a new state-of-the-art on SpeechCommandsV2, nuanced cross-task transfer with some domains aligning with SUPERB and others not, and the recommendation to integrate few-shot tasks into benchmark suites for robust evaluation in low-resource settings. These results inform benchmark design and practical deployment, highlighting both the transfer potential and the limitations of current self-supervised representations in few-shot audio tasks.

Abstract

In recent years, self-supervised learning has excelled for its capacity to learn robust feature representations from unlabelled data. Networks pretrained through self-supervision serve as effective feature extractors for downstream tasks, including Few-Shot Learning. While the evaluation of unsupervised approaches for few-shot learning is well-established in imagery, it is notably absent in acoustics. This study addresses this gap by assessing large-scale self-supervised models' performance in few-shot audio classification. Additionally, we explore the relationship between a model's few-shot learning capability and other downstream task benchmarks. Our findings reveal state-of-the-art performance in some few-shot problems such as SpeechCommandsv2, as well as strong correlations between speech-based few-shot problems and various downstream audio tasks.
Paper Structure (15 sections, 2 equations, 2 figures, 3 tables)

This paper contains 15 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: SUPERB model score vs average few-shot transfer performance for all considered datasets. (TOP) row contains speech datasets, (BOTTOM) row contains environmental/animal sets. Regression gradients and shaded regions describe correlation strength and 95% confidence intervals respectively. Spearman Rank correlation coefficients (c) are shown top left of each plot.
  • Figure 2: Spearman rank correlations between Few-Shot (rows) and SUPERB (cols) tasks. Few-shot tasks are split into speech (top), environment (mid) and animal (bottom) sounds. SUPERB is split into context, speaker, semantics and paralinguistics (left to right).