Toward Practical Automatic Speech Recognition and Post-Processing: a Call for Explainable Error Benchmark Guideline
Seonmin Koo, Chanjun Park, Jinsung Kim, Jaehyung Seo, Sugyeong Eo, Hyeonseok Moon, Heuiseok Lim
TL;DR
This work addresses the limited explainability of ASR evaluation by proposing the Error Explainable Benchmark (EEB), which jointly considers speech-level errors affecting recognition and text-level errors affecting readability. It introduces a dual taxonomy: 24 speech-noise subtypes plus 13 speaker-characteristic traits, and 13 text-level error types adapted from Korean GEC data, to enable fine-grained diagnosis for ASR and ASR post-processing (ASRP). The authors outline a four-step data construction pipeline—text verification, speech recording, background-noise synthesis, and difficulty annotation—employing consensus labeling to ensure data quality and realism. The proposed framework aims to enable real-world-centric evaluation and improved post-processing, with practical implications for reducing user dissatisfaction and guiding model improvements in ASR systems, particularly in Korean contexts.
Abstract
Automatic speech recognition (ASR) outcomes serve as input for downstream tasks, substantially impacting the satisfaction level of end-users. Hence, the diagnosis and enhancement of the vulnerabilities present in the ASR model bear significant importance. However, traditional evaluation methodologies of ASR systems generate a singular, composite quantitative metric, which fails to provide comprehensive insight into specific vulnerabilities. This lack of detail extends to the post-processing stage, resulting in further obfuscation of potential weaknesses. Despite an ASR model's ability to recognize utterances accurately, subpar readability can negatively affect user satisfaction, giving rise to a trade-off between recognition accuracy and user-friendliness. To effectively address this, it is imperative to consider both the speech-level, crucial for recognition accuracy, and the text-level, critical for user-friendliness. Consequently, we propose the development of an Error Explainable Benchmark (EEB) dataset. This dataset, while considering both speech- and text-level, enables a granular understanding of the model's shortcomings. Our proposition provides a structured pathway for a more `real-world-centric' evaluation, a marked shift away from abstracted, traditional methods, allowing for the detection and rectification of nuanced system weaknesses, ultimately aiming for an improved user experience.
