MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems
Thilo von Neumann, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach
TL;DR
MeetEval addresses the evaluation of meeting transcription systems by providing a unified toolkit that implements multiple WER definitions (cpWER, ORC-WER, MIMO-WER) and a time-constrained variant to improve alignment plausibility. It introduces pseudo-word-level timing approximations when exact word-level timings are unavailable, enabling scalable, accurate matching and practical runtime benefits. The toolkit guides practitioners on metric selection for diarization, CSS, and SOT-style outputs, and demonstrates that the time-constrained approach yields more realistic WERs while enabling runtime pruning. Overall, MeetEval supports reproducible, comparable benchmarking across diverse meeting transcription systems and facilitates rigorous analysis of transcription errors.
Abstract
MeetEval is an open-source toolkit to evaluate all kinds of meeting transcription systems. It provides a unified interface for the computation of commonly used Word Error Rates (WERs), specifically cpWER, ORC-WER and MIMO-WER along other WER definitions. We extend the cpWER computation by a temporal constraint to ensure that only words are identified as correct when the temporal alignment is plausible. This leads to a better quality of the matching of the hypothesis string to the reference string that more closely resembles the actual transcription quality, and a system is penalized if it provides poor time annotations. Since word-level timing information is often not available, we present a way to approximate exact word-level timings from segment-level timings (e.g., a sentence) and show that the approximation leads to a similar WER as a matching with exact word-level annotations. At the same time, the time constraint leads to a speedup of the matching algorithm, which outweighs the additional overhead caused by processing the time stamps.
