Table of Contents
Fetching ...

Careless Whisper: Speech-to-Text Hallucination Harms

Allison Koenecke, Anna Seo Gyeong Choi, Katelyn X. Mei, Hilke Schellmann, Mona Sloane

TL;DR

The paper systematically quantifies hallucinations in Whisper across AphasiaBank English data, defining hallucinations as content not faithful to the audio and distinguishing harms beyond mere transcription errors. It shows that about $1\%$ of transcriptions contain hallucinations, with roughly $38\%$ of these being harmful, and finds higher rates for speakers with aphasia, linked to longer non-vocal segments. The work analyzes underlying causes—end-to-end generative modeling and speech-disfluency patterns—and documents significant ethical and legal implications, including bias amplification and safety risks for vulnerable populations. It concludes with concrete calls to action for disclosure, inclusive design, default calibration, and further research to mitigate harms in downstream applications.

Abstract

Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI's Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper's transcriptions were highly accurate, we find that roughly 1\% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38\% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority. We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group. We find that hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations -- a common symptom of aphasia. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases amplified by hallucinations in downstream applications of speech-to-text models.

Careless Whisper: Speech-to-Text Hallucination Harms

TL;DR

The paper systematically quantifies hallucinations in Whisper across AphasiaBank English data, defining hallucinations as content not faithful to the audio and distinguishing harms beyond mere transcription errors. It shows that about of transcriptions contain hallucinations, with roughly of these being harmful, and finds higher rates for speakers with aphasia, linked to longer non-vocal segments. The work analyzes underlying causes—end-to-end generative modeling and speech-disfluency patterns—and documents significant ethical and legal implications, including bias amplification and safety risks for vulnerable populations. It concludes with concrete calls to action for disclosure, inclusive design, default calibration, and further research to mitigate harms in downstream applications.

Abstract

Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI's Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper's transcriptions were highly accurate, we find that roughly 1\% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38\% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority. We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group. We find that hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations -- a common symptom of aphasia. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases amplified by hallucinations in downstream applications of speech-to-text models.
Paper Structure (19 sections, 5 figures, 2 tables)

This paper contains 19 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Hallucinations are more common for speakers with aphasia than without, and can cause harm by nature of perpetuating violence, inaccurate associations, and false authority.
  • Figure 2: Two examples of control speakers whose Whisper transcriptions from December 2023 included non-English text, despite the API setting for language being set to English. The first example is not a hallucination, whereas the second example is hallucinated (involving a repeating loop, and displaying a harm of false authority: thanking.)
  • Figure 3: Speakers with aphasia had audio files with significantly longer shares of non-vocal sounds (i.e., PyAnnote non-vocal duration in seconds, divided by total duration in seconds) relative to their control speaker counterparts. Furthermore, non-vocal shares of audio files were significantly higher for files with Whisper hallucinations as opposed to files that did not yield hallucinations. Mean non-vocal shares for aphasia speakers with hallucinations, aphasia speakers without hallucinations, control speakers with hallucinations, and control speakers without hallucinations are: 42.4%, 40.6%, 16.2%, and 15.4%, respectively.
  • Figure 4: Mahalanobis matching on participant demographics. On the matched subset, audio segments spoken by aphasia speakers continue to show higher rates of hallucinations relative to segments spoken by control group speakers.
  • Figure 5: Our findings on nonvocal durations are consistent when using a different package to perform Voice Activity Detection (VAD). When using Silero via PyTorch Silero_VAD (instead of PyAnnote Bredin2021), we continue to find that aphasia speakers and audio yielding hallucinations have longer non-vocal durations relative to control speakers and audio not yielding hallucinations.