What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations

Kavya Manohar; Leena G Pillai; Elizabeth Sherly

What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations

Kavya Manohar, Leena G Pillai, Elizabeth Sherly

TL;DR

This paper explores the pitfalls in evaluating multilingual automatic speech recognition (ASR) models, with a particular focus on Indic language scripts, and proposes a shift towards developing text normalization routines that leverage native linguistic expertise, ensuring more robust and accurate evaluations of multilingual ASR models.

Abstract

This paper explores the pitfalls in evaluating multilingual automatic speech recognition (ASR) models, with a particular focus on Indic language scripts. We investigate the text normalization routine employed by leading ASR models, including OpenAI Whisper, Meta's MMS, Seamless, and Assembly AI's Conformer, and their unintended consequences on performance metrics. Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, by removing inconsistencies such as variations in spelling, punctuation, and special characters, are fundamentally flawed when applied to Indic scripts. Through empirical analysis using text similarity scores and in-depth linguistic examination, we demonstrate that these flaws lead to artificially improved performance metrics for Indic languages. We conclude by proposing a shift towards developing text normalization routines that leverage native linguistic expertise, ensuring more robust and accurate evaluations of multilingual ASR models.

What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations

TL;DR

Abstract

Paper Structure (9 sections, 1 figure, 1 table)

This paper contains 9 sections, 1 figure, 1 table.

Introduction
Background and Related Works
Methodology
Analysis of Text Similarity after Whisper Normalization
Impact of Whisper Normalization on WER
Recommendations
Conclusions
Limitations
Resources

Figures (1)

Figure 1: Performance comparison of the OpenAI Whisper-Small model across different languages. The graph on left shows WER on the original model and the one on right shows the result after language specific finetuning. Regular WER are computed on raw transcripts and normalized WER are computed on Whisper normalized transcripts.

What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations

TL;DR

Abstract

What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations

Authors

TL;DR

Abstract

Table of Contents

Figures (1)