A Benchmark of French ASR Systems Based on Error Severity
Antoine Tholly, Jane Wottawa, Mickael Rouvier, Richard Dufour
TL;DR
The paper tackles the limitation of traditional ASR evaluation by introducing a human-centric error severity typology for French, moving beyond Word Error Rate to assess interpretability. It defines four error categories—Lex, Gram, Cotx, and Fail—each capturing different levels of readability and contextual dependency, and applies this framework to a benchmark of 10 ASR systems using the REPERE French corpus with expert annotations. The experimental protocol combines Kaldi and SpeechBrain systems with varied SSL models and tokenizers, revealing that Kaldi with rescoring yields the best overall performance while certain LeBenchmark configurations excel at handling the most critical Fail errors. The study demonstrates that the proposed metric provides richer, more actionable insights for ASR benchmarking and suggests perceptual validation to further align evaluations with user experience, with potential applicability to multiple languages.
Abstract
Automatic Speech Recognition (ASR) transcription errors are commonly assessed using metrics that compare them with a reference transcription, such as Word Error Rate (WER), which measures spelling deviations from the reference, or semantic score-based metrics. However, these approaches often overlook what is understandable to humans when interpreting transcription errors. To address this limitation, a new evaluation is proposed that categorizes errors into four levels of severity, further divided into subtypes, based on objective linguistic criteria, contextual patterns, and the use of content words as the unit of analysis. This metric is applied to a benchmark of 10 state-of-the-art ASR systems on French language, encompassing both HMM-based and end-to-end models. Our findings reveal the strengths and weaknesses of each system, identifying those that provide the most comfortable reading experience for users.
