Table of Contents
Fetching ...

Investigating the Emergent Audio Classification Ability of ASR Foundation Models

Rao Ma, Adian Liusie, Mark J. F. Gales, Kate M. Knill

TL;DR

This work investigates the ability of Whisper and MMS, ASR foundation models trained primarily for speech recognition, to perform zero-shot audio classification and shows that performance increases with model size, implying that as ASR foundation models scale up, they may exhibit improved zero-shot performance.

Abstract

Text and vision foundation models can perform many tasks in a zero-shot setting, a desirable property that enables these systems to be applied in general and low-resource settings. There has been far less work, however, on the zero-shot abilities of ASR foundation models, with these systems typically fine-tuned to specific tasks or constrained to applications that match their training criterion and data annotation. In this work we investigate the ability of Whisper and MMS, ASR foundation models trained primarily for speech recognition, to perform zero-shot audio classification. We use simple template-based text prompts at the decoder and use the resulting decoding probabilities to generate zero-shot predictions. Without training the model on extra data or adding any new parameters, we demonstrate that Whisper shows promising zero-shot classification performance on a range of 8 audio-classification datasets, outperforming the accuracy of existing state-of-the-art zero-shot baselines by an average of 9%. One important step to unlock the emergent ability is debiasing, where a simple unsupervised reweighting method of the class probabilities yields consistent significant performance gains. We further show that performance increases with model size, implying that as ASR foundation models scale up, they may exhibit improved zero-shot performance.

Investigating the Emergent Audio Classification Ability of ASR Foundation Models

TL;DR

This work investigates the ability of Whisper and MMS, ASR foundation models trained primarily for speech recognition, to perform zero-shot audio classification and shows that performance increases with model size, implying that as ASR foundation models scale up, they may exhibit improved zero-shot performance.

Abstract

Text and vision foundation models can perform many tasks in a zero-shot setting, a desirable property that enables these systems to be applied in general and low-resource settings. There has been far less work, however, on the zero-shot abilities of ASR foundation models, with these systems typically fine-tuned to specific tasks or constrained to applications that match their training criterion and data annotation. In this work we investigate the ability of Whisper and MMS, ASR foundation models trained primarily for speech recognition, to perform zero-shot audio classification. We use simple template-based text prompts at the decoder and use the resulting decoding probabilities to generate zero-shot predictions. Without training the model on extra data or adding any new parameters, we demonstrate that Whisper shows promising zero-shot classification performance on a range of 8 audio-classification datasets, outperforming the accuracy of existing state-of-the-art zero-shot baselines by an average of 9%. One important step to unlock the emergent ability is debiasing, where a simple unsupervised reweighting method of the class probabilities yields consistent significant performance gains. We further show that performance increases with model size, implying that as ASR foundation models scale up, they may exhibit improved zero-shot performance.
Paper Structure (29 sections, 8 equations, 8 figures, 13 tables)

This paper contains 29 sections, 8 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: This paper looks at zero-shot prompting of ASR foundation models for audio classification, without any further training or introducing any new parameters. We use task-specific prompts and evaluate on various downstream tasks and datasets.
  • Figure 2: ASR foundation models are leveraged for zero-shot audio classification by prompting the decoder to calculate the log-likelihood of label sequences associated with each class. The log-likelihood for each class is converted to probabilities and post-processed to a predicted class. This process is displayed for Whisper.
  • Figure 3: Predicted class distribution for Whisper large-v2 on RAVDESS. Bar width is proportional to the fraction of decisions per class.
  • Figure 4: Parameter size vs average accuracy (with prior-matching) for different versions of Whisper models.
  • Figure 5: Zero-shot audio question answering method.
  • ...and 3 more figures