Word-Level ASR Quality Estimation for Efficient Corpus Sampling and Post-Editing through Analyzing Attentions of a Reference-Free Metric

Golara Javadi; Kamer Ali Yuksel; Yunsu Kim; Thiago Castro Ferreira; Mohamed Al-Badrashiny

Word-Level ASR Quality Estimation for Efficient Corpus Sampling and Post-Editing through Analyzing Attentions of a Reference-Free Metric

Golara Javadi, Kamer Ali Yuksel, Yunsu Kim, Thiago Castro Ferreira, Mohamed Al-Badrashiny

TL;DR

This work tackles the lack of transparency in ASR by presenting NoRefER, a reference-free quality estimation metric that leverages scaled attention signals to identify word-level transcription errors. It combines a compact cross-lingual transformer (MiniLMv2) with attention normalization by the $L2$ norm of value vectors and word-level aggregation, enabling error ranking without ground-truth transcripts. The approach demonstrates strong correlation with actual errors on LibriSpeech and Common Voice across multiple languages, often outperforming confidence-based baselines, and supports post-editing prioritization and corpus-building, including applicability to commercial black-box models. Public release of source code ensures reproducibility and practical adoption for ASR explainability, data efficiency, and targeted model improvements.

Abstract

In the realm of automatic speech recognition (ASR), the quest for models that not only perform with high accuracy but also offer transparency in their decision-making processes is crucial. The potential of quality estimation (QE) metrics is introduced and evaluated as a novel tool to enhance explainable artificial intelligence (XAI) in ASR systems. Through experiments and analyses, the capabilities of the NoRefER (No Reference Error Rate) metric are explored in identifying word-level errors to aid post-editors in refining ASR hypotheses. The investigation also extends to the utility of NoRefER in the corpus-building process, demonstrating its effectiveness in augmenting datasets with insightful annotations. The diagnostic aspects of NoRefER are examined, revealing its ability to provide valuable insights into model behaviors and decision patterns. This has proven beneficial for prioritizing hypotheses in post-editing workflows and fine-tuning ASR models. The findings suggest that NoRefER is not merely a tool for error detection but also a comprehensive framework for enhancing ASR systems' transparency, efficiency, and effectiveness. To ensure the reproducibility of the results, all source codes of this study are made publicly available.

Word-Level ASR Quality Estimation for Efficient Corpus Sampling and Post-Editing through Analyzing Attentions of a Reference-Free Metric

TL;DR

norm of value vectors and word-level aggregation, enabling error ranking without ground-truth transcripts. The approach demonstrates strong correlation with actual errors on LibriSpeech and Common Voice across multiple languages, often outperforming confidence-based baselines, and supports post-editing prioritization and corpus-building, including applicability to commercial black-box models. Public release of source code ensures reproducibility and practical adoption for ASR explainability, data efficiency, and targeted model improvements.

Abstract

Paper Structure (7 sections, 2 figures, 4 tables)

This paper contains 7 sections, 2 figures, 4 tables.

Introduction
Related works
Methodology
Dataset and metrics
Experiments
Discussion
Conclusions

Figures (2)

Figure 1: NoRefER attention for Error Identification in token and word level. Attention scores (with the L2 norm scaling) are averaged across all model layers for interpretability.
Figure 2: Variation of Weighted Precision, Recall, and F1-Score with increasing $k$ in the evaluation of word-level attention.

Word-Level ASR Quality Estimation for Efficient Corpus Sampling and Post-Editing through Analyzing Attentions of a Reference-Free Metric

TL;DR

Abstract

Word-Level ASR Quality Estimation for Efficient Corpus Sampling and Post-Editing through Analyzing Attentions of a Reference-Free Metric

Authors

TL;DR

Abstract

Table of Contents

Figures (2)