Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers
Yuan Gong, Sameer Khurana, Leonid Karlinsky, James Glass
TL;DR
The paper interrogates how Whisper achieves noise-robust ASR and reveals its representations encode rich non-speech background information rather than being noise-invariant. It then introduces Whisper-AT, a unified ASR and audio tagging approach that freezes the Whisper backbone and adds lightweight tagging heads, achieving competitive audio tagging performance with minimal compute. Key findings show a positive link between ASR robustness and background-sound recognition, enabling efficient joint transcription and event tagging. Empirically, Whisper-AT attains strong AudioSet and ESC-50 results (e.g., 41.5 mAP on AudioSet) while delivering substantial speedups over standalone audio tagging models, illustrating practical benefits for integrated audio understanding in real-world systems.
Abstract
In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass.
