Table of Contents
Fetching ...

WhisperD: Dementia Speech Recognition and Filler Word Detection with Whisper

Emmanuel Akinrintoyo, Nadine Abdelhalim, Nicole Salomons

TL;DR

This work targets the transcription of dementia speech, which is challenged by disfluencies and fillers that standard ASR like Whisper struggle to handle. The authors fine-tune Whisper on DementiaBank Pitt/Kempler data and an in-house CONNECT corpus, extending the tokenizer to include filler tokens and segmenting speech into short chunks to better capture fragmented utterances. The resulting WhisperD models achieve state-of-the-art performance, with a test-set WER as low as $0.24$ (WhisperD-M) and strong filler-detection metrics (F1 up to ~0.76 for WhisperD-S), demonstrating robust generalization to unseen speakers. These results suggest a cost-effective path for dementia screening and personalized assistive technologies, with potential extensions to multilingual settings and larger datasets.

Abstract

Whisper fails to correctly transcribe dementia speech because persons with dementia (PwDs) often exhibit irregular speech patterns and disfluencies such as pauses, repetitions, and fragmented sentences. It was trained on standard speech and may have had little or no exposure to dementia-affected speech. However, correct transcription is vital for dementia speech for cost-effective diagnosis and the development of assistive technology. In this work, we fine-tune Whisper with the open-source dementia speech dataset (DementiaBank) and our in-house dataset to improve its word error rate (WER). The fine-tuning also includes filler words to ascertain the filler inclusion rate (FIR) and F1 score. The fine-tuned models significantly outperformed the off-the-shelf models. The medium-sized model achieved a WER of 0.24, outperforming previous work. Similarly, there was a notable generalisability to unseen data and speech patterns.

WhisperD: Dementia Speech Recognition and Filler Word Detection with Whisper

TL;DR

This work targets the transcription of dementia speech, which is challenged by disfluencies and fillers that standard ASR like Whisper struggle to handle. The authors fine-tune Whisper on DementiaBank Pitt/Kempler data and an in-house CONNECT corpus, extending the tokenizer to include filler tokens and segmenting speech into short chunks to better capture fragmented utterances. The resulting WhisperD models achieve state-of-the-art performance, with a test-set WER as low as (WhisperD-M) and strong filler-detection metrics (F1 up to ~0.76 for WhisperD-S), demonstrating robust generalization to unseen speakers. These results suggest a cost-effective path for dementia screening and personalized assistive technologies, with potential extensions to multilingual settings and larger datasets.

Abstract

Whisper fails to correctly transcribe dementia speech because persons with dementia (PwDs) often exhibit irregular speech patterns and disfluencies such as pauses, repetitions, and fragmented sentences. It was trained on standard speech and may have had little or no exposure to dementia-affected speech. However, correct transcription is vital for dementia speech for cost-effective diagnosis and the development of assistive technology. In this work, we fine-tune Whisper with the open-source dementia speech dataset (DementiaBank) and our in-house dataset to improve its word error rate (WER). The fine-tuning also includes filler words to ascertain the filler inclusion rate (FIR) and F1 score. The fine-tuned models significantly outperformed the off-the-shelf models. The medium-sized model achieved a WER of 0.24, outperforming previous work. Similarly, there was a notable generalisability to unseen data and speech patterns.

Paper Structure

This paper contains 13 sections, 1 equation, 1 figure, 1 table.

Figures (1)

  • Figure 1: Speed Comparison of Whisper Models (Tiny (T), Base (B), Small (S) and Medium (M)): Dementia vs Control Speech