Table of Contents
Fetching ...

Evaluating ASR Confidence Scores for Automated Error Detection in User-Assisted Correction Interfaces

Korbinian Kuhn, Verena Kersken, Gottfried Zimmermann

TL;DR

This study investigates whether word-level confidence scores from end-to-end ASR systems can reliably detect transcription errors and assist user-led correction. It combines a large-scale, multi-model ASR evaluation with a German user study examining three error-highlighting interfaces. Results show a consistent correlation between transcript-level confidence and overall accuracy, but weak word-level discrimination yields modest, unstable error detection performance (precision ~0.41–0.55, recall ~0.36–0.64, AUC ~0.68–0.87). Importantly, user corrections did not improve with confidence-based highlighting, and participants largely preferred minimal or balanced marking, indicating that current confidence scores offer limited practical utility and highlighting the need for more robust, explainable approaches. The work informs HCI researchers about the constraints of confidence-based interfaces and points toward developing richer uncertainty representations to improve ASR-assisted editing.

Abstract

Despite advances in Automatic Speech Recognition (ASR), transcription errors persist and require manual correction. Confidence scores, which indicate the certainty of ASR results, could assist users in identifying and correcting errors. This study evaluates the reliability of confidence scores for error detection through a comprehensive analysis of end-to-end ASR models and a user study with 36 participants. The results show that while confidence scores correlate with transcription accuracy, their error detection performance is limited. Classifiers frequently miss errors or generate many false positives, undermining their practical utility. Confidence-based error detection neither improved correction efficiency nor was perceived as helpful by participants. These findings highlight the limitations of confidence scores and the need for more sophisticated approaches to improve user interaction and explainability of ASR results.

Evaluating ASR Confidence Scores for Automated Error Detection in User-Assisted Correction Interfaces

TL;DR

This study investigates whether word-level confidence scores from end-to-end ASR systems can reliably detect transcription errors and assist user-led correction. It combines a large-scale, multi-model ASR evaluation with a German user study examining three error-highlighting interfaces. Results show a consistent correlation between transcript-level confidence and overall accuracy, but weak word-level discrimination yields modest, unstable error detection performance (precision ~0.41–0.55, recall ~0.36–0.64, AUC ~0.68–0.87). Importantly, user corrections did not improve with confidence-based highlighting, and participants largely preferred minimal or balanced marking, indicating that current confidence scores offer limited practical utility and highlighting the need for more robust, explainable approaches. The work informs HCI researchers about the constraints of confidence-based interfaces and points toward developing richer uncertainty representations to improve ASR-assisted editing.

Abstract

Despite advances in Automatic Speech Recognition (ASR), transcription errors persist and require manual correction. Confidence scores, which indicate the certainty of ASR results, could assist users in identifying and correcting errors. This study evaluates the reliability of confidence scores for error detection through a comprehensive analysis of end-to-end ASR models and a user study with 36 participants. The results show that while confidence scores correlate with transcription accuracy, their error detection performance is limited. Classifiers frequently miss errors or generate many false positives, undermining their practical utility. Confidence-based error detection neither improved correction efficiency nor was perceived as helpful by participants. These findings highlight the limitations of confidence scores and the need for more sophisticated approaches to improve user interaction and explainability of ASR results.

Paper Structure

This paper contains 30 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Correction interface of the user study.