Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish

Michał Junczyk

Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish

Michał Junczyk

TL;DR

The paper tackles the lack of standardized Polish ASR benchmarks caused by discoverability and interoperability issues. It introduces a three-part framework—surveying datasets, curating a Polish benchmark (BIGO S) from 24 openly available datasets, and evaluating ASR systems—to enable reproducible, scalable benchmarking. Through 7 systems and 25 models across BIGOS and PELCRA, it reveals clear performance differences driven by system type, model size, and speech style, with results publicized via dashboards and open-source tools. The framework enhances reproducibility, encourages data sharing, and can be extended to other languages, providing a concrete path toward more robust and comparable ASR evaluations in low-resource or underrepresented languages.

Abstract

Speech datasets available in the public domain are often underutilized because of challenges in discoverability and interoperability. A comprehensive framework has been designed to survey, catalog, and curate available speech datasets, which allows replicable evaluation of automatic speech recognition (ASR) systems. A case study focused on the Polish language was conducted; the framework was applied to curate more than 24 datasets and evaluate 25 combinations of ASR systems and models. This research constitutes the most extensive comparison to date of both commercial and free ASR systems for the Polish language. It draws insights from 600 system-model-test set evaluations, marking a significant advancement in both scale and comprehensiveness. The results of surveys and performance comparisons are available as interactive dashboards (https://huggingface.co/spaces/amu-cai/pl-asr-leaderboard) along with curated datasets (https://huggingface.co/datasets/amu-cai/pl-asr-bigos-v2, https://huggingface.co/datasets/pelcra/pl-asr-pelcra-for-bigos) and the open challenge call (https://poleval.pl/tasks/task3). Tools used for evaluation are open-sourced (https://github.com/goodmike31/pl-asr-bigos-tools), facilitating replication and adaptation for other languages, as well as continuous expansion with new datasets and systems.

Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish

TL;DR

Abstract

Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish

Authors

TL;DR

Abstract

Table of Contents

Figures (7)