Table of Contents
Fetching ...

Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection

Griffin Dietz Smith, Dianna Yee, Jennifer King Chen, Leah Findlater

TL;DR

This work tackles reading-aloud miscue annotation by introducing an end-to-end model that jointly predicts verbatim transcription and miscue events. It demonstrates that incorporating target reading text via prompting improves transcription accuracy more than fine-tuning alone, and that augmenting the ASR with miscue-token vocabulary enables end-to-end miscue detection. Through two case studies—children's read-aloud speech and adult atypical speech—it shows improvements over strong baselines, with prompting offering robust gains and the E2E approach providing competitive transcription under distributional shifts. Post-hoc miscue derivation from accurate transcripts remains most precise for miscue detection, indicating a complementary role for E2E modeling. Overall, the study advances reading-annotation methods by leveraging contextual text and expanding end-to-end capabilities for miscue detection in diverse speech populations.

Abstract

Identifying mistakes (i.e., miscues) made while reading aloud is commonly approached post-hoc by comparing automatic speech recognition (ASR) transcriptions to the target reading text. However, post-hoc methods perform poorly when ASR inaccurately transcribes verbatim speech. To improve on current methods for reading error annotation, we propose a novel end-to-end architecture that incorporates the target reading text via prompting and is trained for both improved verbatim transcription and direct miscue detection. Our contributions include: first, demonstrating that incorporating reading text through prompting benefits verbatim transcription performance over fine-tuning, and second, showing that it is feasible to augment speech recognition tasks for end-to-end miscue detection. We conducted two case studies -- children's read-aloud and adult atypical speech -- and found that our proposed strategies improve verbatim transcription and miscue detection compared to current state-of-the-art.

Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection

TL;DR

This work tackles reading-aloud miscue annotation by introducing an end-to-end model that jointly predicts verbatim transcription and miscue events. It demonstrates that incorporating target reading text via prompting improves transcription accuracy more than fine-tuning alone, and that augmenting the ASR with miscue-token vocabulary enables end-to-end miscue detection. Through two case studies—children's read-aloud speech and adult atypical speech—it shows improvements over strong baselines, with prompting offering robust gains and the E2E approach providing competitive transcription under distributional shifts. Post-hoc miscue derivation from accurate transcripts remains most precise for miscue detection, indicating a complementary role for E2E modeling. Overall, the study advances reading-annotation methods by leveraging contextual text and expanding end-to-end capabilities for miscue detection in diverse speech populations.

Abstract

Identifying mistakes (i.e., miscues) made while reading aloud is commonly approached post-hoc by comparing automatic speech recognition (ASR) transcriptions to the target reading text. However, post-hoc methods perform poorly when ASR inaccurately transcribes verbatim speech. To improve on current methods for reading error annotation, we propose a novel end-to-end architecture that incorporates the target reading text via prompting and is trained for both improved verbatim transcription and direct miscue detection. Our contributions include: first, demonstrating that incorporating reading text through prompting benefits verbatim transcription performance over fine-tuning, and second, showing that it is feasible to augment speech recognition tasks for end-to-end miscue detection. We conducted two case studies -- children's read-aloud and adult atypical speech -- and found that our proposed strategies improve verbatim transcription and miscue detection compared to current state-of-the-art.

Paper Structure

This paper contains 12 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Proposed architecture for end-to-end miscue detection by augmenting Whisper to include miscue tokens and incorporate reading text via prompting.
  • Figure 2: Example of a target reading text prompt and potential miscue events defined in Section \ref{['section:method']}.
  • Figure 3: Examples of ground truth transcripts, predicted transcripts, and processing applied to evaluate MD performance given a reading text prompt. Predicted transcripts are processed in terms of predicted miscue event tokens and 'correct' tokens. Note that '<correct>' is shown as '<C>' for readability and incorrectly predicted miscue event tokens are highlighted in red. F1 is computed using the predicted and ground truth miscue event tokens, with no_tag inserted where needed for alignment.