Table of Contents
Fetching ...

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

Yuan Gong, Sameer Khurana, Leonid Karlinsky, James Glass

TL;DR

The paper interrogates how Whisper achieves noise-robust ASR and reveals its representations encode rich non-speech background information rather than being noise-invariant. It then introduces Whisper-AT, a unified ASR and audio tagging approach that freezes the Whisper backbone and adds lightweight tagging heads, achieving competitive audio tagging performance with minimal compute. Key findings show a positive link between ASR robustness and background-sound recognition, enabling efficient joint transcription and event tagging. Empirically, Whisper-AT attains strong AudioSet and ESC-50 results (e.g., 41.5 mAP on AudioSet) while delivering substantial speedups over standalone audio tagging models, illustrating practical benefits for integrated audio understanding in real-world systems.

Abstract

In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass.

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

TL;DR

The paper interrogates how Whisper achieves noise-robust ASR and reveals its representations encode rich non-speech background information rather than being noise-invariant. It then introduces Whisper-AT, a unified ASR and audio tagging approach that freezes the Whisper backbone and adds lightweight tagging heads, achieving competitive audio tagging performance with minimal compute. Key findings show a positive link between ASR robustness and background-sound recognition, enabling efficient joint transcription and event tagging. Empirically, Whisper-AT attains strong AudioSet and ESC-50 results (e.g., 41.5 mAP on AudioSet) while delivering substantial speedups over standalone audio tagging models, illustrating practical benefits for integrated audio understanding in real-world systems.

Abstract

In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass.
Paper Structure (8 sections, 5 figures, 2 tables)

This paper contains 8 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Surprisingly, the noise robustness of an ASR model correlates positively to the amount of general background sound (noise for ASR) information encoded in their intermediate representations. In the upper figure, we show Whisper is noticeably more robust (smaller word error rate increase) when speech (Librispeech) is contaminated with an increasing amount of background sounds from ESC-50 piczak2015esc. In the lower figure, we show the intermediate representations of Whisper lead to the best linear probing sound classification accuracy on the same ESC-50 data, indicating Whisper encodes most background sound information. Unlike other models, Whisper encodes background sound information even in its deepest layer. PR=self-supervised pretrained; FT=PR and fine-tuned model.
  • Figure 2: Class-wise analysis of the relationship between Whisper's robustness against a specific background sound class and its potential ability to recognize the sound. We measure Whisper robustness by its WER increase from clean speech (20dB SNR) to speech contaminated by the specific background sound from ESC-50 (-10dB SNR). The lower the WER increase, the more robust the model (Y-axis). We estimate the potential ability of Whisper to recognize the sound by training a linear layer on top of the Whisper encoder's last-layer representation for the sound classification task on the same ESC-50 dataset (without speech mixed-in, the Whisper model is frozen) and show the class-wise F1-score. The higher the F1-score, the better Whisper can potentially recognize the sound class (X-axis). Blue dashed line: we observe a positive correlation between Whisper's robustness against a background sound type and its potential ability to recognize it. Blue shading: we observe most sound classes lie in the right-bottom triangle area, indicating that Whisper is not robust to the type of sound if it cannot recognize the sound type. Right-bottom outliers: there are some background sounds that Whisper can potentially recognize but is not robust to, which is expected as some noises heavily overlap with the speech and are impossible to be robust to. In short, we find the potential ability to recognize a sound type is a necessary but not sufficient condition for Whisper to be robust to it.
  • Figure 3: Histrogram of the best Whisper representation layer (1-32) for the 50 ESC-50 sound classes. We train a linear layer on top of the representation of each of the 32 Whisper layers for ESC-50 sound classification, compute the class-wise F1-Score, and find the best representation layer for each sound class. Different sound classes get the best F1-score on representations of different layers.
  • Figure 4: The proposed time and layer-wise Transformer model.
  • Figure 5: AS-2M audio tagging performance (left) and ASR robustness (right) of the Whisper model family.