Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio
Mateusz Barański, Jan Jasiński, Julitta Bartolewska, Stanisław Kacprzak, Marcin Witkowski, Konrad Kowalczyk
TL;DR
The paper investigates Whisper ASR hallucinations triggered by non-speech audio, compiling a large non-speech audio corpus to quantify and characterize these errors. It introduces a Bag of Hallucinations (BoH) by filtering frequent outputs via a log-probability threshold and occurrence counts, and couples BoH with delooping and Aho–Corasick search to post-process transcriptions. It then examines how augmenting speech with non-speech sounds affects hallucinations and evaluates mitigation strategies, including VAD-based pre-processing and forced alignment-based validation. The results show that BoH-based post-processing, especially when combined with robust VAD, can reduce WER and provide a practical safeguard against dangerous hallucinations, though no single method completely eliminates them.
Abstract
Hallucinations of deep neural models are amongst key challenges in automatic speech recognition (ASR). In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. By inducting hallucinations with various types of sounds, we show that there exists a set of hallucinations that appear frequently. We then study hallucinations caused by the augmentation of speech with such sounds. Finally, we describe the creation of a bag of hallucinations (BoH) that allows to remove the effect of hallucinations through the post-processing of text transcriptions. The results of our experiments show that such post-processing is capable of reducing word error rate (WER) and acts as a good safeguard against problematic hallucinations.
