Table of Contents
Fetching ...

Style-agnostic evaluation of ASR using multiple reference transcripts

Quinten McNamara, Miguel Ángel del Río Fernández, Nishchal Bhandari, Martin Ratajczak, Danny Chen, Corey Miller, Migüel Jetté

TL;DR

This paper tackles the problem that word error rate (WER) conflates content errors with stylistic transcription differences, leading to biased ASR benchmarking. It introduces a finite-state transducer (FST)–based multireference evaluation that uses verbatim and nonverbatim human transcripts to create a style-agnostic scoring framework, demonstrated on Rev16 and Earnings-22 Sub10 long-form datasets. By tagging references as V, NV, or GOLD and comparing word-level vs span-level constructions, the authors derive bounds and show that multireference WER and GOLD WER yield substantially lower error estimates than traditional single-reference WER, revealing that many reported errors are style-driven rather than content errors. The approach highlights the importance of accounting for transcription style in benchmarking and suggests practical paths forward, including expanding to 3+ references and weighting schemes to balance human corrections with stylistic variation, thereby enabling fairer cross-model comparisons in ASR.

Abstract

Word error rate (WER) as a metric has a variety of limitations that have plagued the field of speech recognition. Evaluation datasets suffer from varying style, formality, and inherent ambiguity of the transcription task. In this work, we attempt to mitigate some of these differences by performing style-agnostic evaluation of ASR systems using multiple references transcribed under opposing style parameters. As a result, we find that existing WER reports are likely significantly over-estimating the number of contentful errors made by state-of-the-art ASR systems. In addition, we have found our multireference method to be a useful mechanism for comparing the quality of ASR models that differ in the stylistic makeup of their training data and target task.

Style-agnostic evaluation of ASR using multiple reference transcripts

TL;DR

This paper tackles the problem that word error rate (WER) conflates content errors with stylistic transcription differences, leading to biased ASR benchmarking. It introduces a finite-state transducer (FST)–based multireference evaluation that uses verbatim and nonverbatim human transcripts to create a style-agnostic scoring framework, demonstrated on Rev16 and Earnings-22 Sub10 long-form datasets. By tagging references as V, NV, or GOLD and comparing word-level vs span-level constructions, the authors derive bounds and show that multireference WER and GOLD WER yield substantially lower error estimates than traditional single-reference WER, revealing that many reported errors are style-driven rather than content errors. The approach highlights the importance of accounting for transcription style in benchmarking and suggests practical paths forward, including expanding to 3+ references and weighting schemes to balance human corrections with stylistic variation, thereby enabling fairer cross-model comparisons in ASR.

Abstract

Word error rate (WER) as a metric has a variety of limitations that have plagued the field of speech recognition. Evaluation datasets suffer from varying style, formality, and inherent ambiguity of the transcription task. In this work, we attempt to mitigate some of these differences by performing style-agnostic evaluation of ASR systems using multiple references transcribed under opposing style parameters. As a result, we find that existing WER reports are likely significantly over-estimating the number of contentful errors made by state-of-the-art ASR systems. In addition, we have found our multireference method to be a useful mechanism for comparing the quality of ASR models that differ in the stylistic makeup of their training data and target task.

Paper Structure

This paper contains 10 sections, 4 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: An alignment between two example transcripts and the resulting span-level multireference FST constructed. The three colors of boxes in the transcripts indicate the three tags discussed in \ref{['ssec:tagging']}.