Table of Contents
Fetching ...

MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems

Thilo von Neumann, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

TL;DR

MeetEval addresses the evaluation of meeting transcription systems by providing a unified toolkit that implements multiple WER definitions (cpWER, ORC-WER, MIMO-WER) and a time-constrained variant to improve alignment plausibility. It introduces pseudo-word-level timing approximations when exact word-level timings are unavailable, enabling scalable, accurate matching and practical runtime benefits. The toolkit guides practitioners on metric selection for diarization, CSS, and SOT-style outputs, and demonstrates that the time-constrained approach yields more realistic WERs while enabling runtime pruning. Overall, MeetEval supports reproducible, comparable benchmarking across diverse meeting transcription systems and facilitates rigorous analysis of transcription errors.

Abstract

MeetEval is an open-source toolkit to evaluate all kinds of meeting transcription systems. It provides a unified interface for the computation of commonly used Word Error Rates (WERs), specifically cpWER, ORC-WER and MIMO-WER along other WER definitions. We extend the cpWER computation by a temporal constraint to ensure that only words are identified as correct when the temporal alignment is plausible. This leads to a better quality of the matching of the hypothesis string to the reference string that more closely resembles the actual transcription quality, and a system is penalized if it provides poor time annotations. Since word-level timing information is often not available, we present a way to approximate exact word-level timings from segment-level timings (e.g., a sentence) and show that the approximation leads to a similar WER as a matching with exact word-level annotations. At the same time, the time constraint leads to a speedup of the matching algorithm, which outweighs the additional overhead caused by processing the time stamps.

MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems

TL;DR

MeetEval addresses the evaluation of meeting transcription systems by providing a unified toolkit that implements multiple WER definitions (cpWER, ORC-WER, MIMO-WER) and a time-constrained variant to improve alignment plausibility. It introduces pseudo-word-level timing approximations when exact word-level timings are unavailable, enabling scalable, accurate matching and practical runtime benefits. The toolkit guides practitioners on metric selection for diarization, CSS, and SOT-style outputs, and demonstrates that the time-constrained approach yields more realistic WERs while enabling runtime pruning. Overall, MeetEval supports reproducible, comparable benchmarking across diverse meeting transcription systems and facilitates rigorous analysis of transcription errors.

Abstract

MeetEval is an open-source toolkit to evaluate all kinds of meeting transcription systems. It provides a unified interface for the computation of commonly used Word Error Rates (WERs), specifically cpWER, ORC-WER and MIMO-WER along other WER definitions. We extend the cpWER computation by a temporal constraint to ensure that only words are identified as correct when the temporal alignment is plausible. This leads to a better quality of the matching of the hypothesis string to the reference string that more closely resembles the actual transcription quality, and a system is penalized if it provides poor time annotations. Since word-level timing information is often not available, we present a way to approximate exact word-level timings from segment-level timings (e.g., a sentence) and show that the approximation leads to a similar WER as a matching with exact word-level annotations. At the same time, the time constraint leads to a speedup of the matching algorithm, which outweighs the additional overhead caused by processing the time stamps.
Paper Structure (19 sections, 5 equations, 5 figures, 1 table)

This paper contains 19 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: MeetEval is tailored to the meeting transcription scenario. The input to a speech recognizer contains speech of multiple speakers with potential overlap. The output formats of the different recognizer styles are visualized using a simple example. Letters represent words, words represent segments/utterances and colors represent speakers.
  • Figure 2: Visualization of the different pseudo-word-level annotation strategies. The collar is visualized as gray boxes and kept short for better visualization. The character-based annotation strategy correlates best with the actual pronunciation time.
  • Figure 3: Density plot of the gap sizes between pseudo-word-level annotations and ground-truth word-level annotations of each word for TIMIT and LibriSpeech.
  • Figure 4: Top: tcpWER over collar for the TS-SEP model on Libri-CSS. The "desired WER" is determined with oracle word-level timestamps. "+ $n$ s pauses" means that segments were artificially merged to include silence of $n$ seconds length. Bottom: Proportion of matchings in the desired WER that would be disallowed by the collar. This value should be 0.
  • Figure 5: Execution time of cpWER and tcpWER on different datasets. The number of recordings in the dataset and the average recording length is given in parenthesis.