Table of Contents
Fetching ...

Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish

Michał Junczyk

TL;DR

The paper tackles the lack of standardized Polish ASR benchmarks caused by discoverability and interoperability issues. It introduces a three-part framework—surveying datasets, curating a Polish benchmark (BIGO S) from 24 openly available datasets, and evaluating ASR systems—to enable reproducible, scalable benchmarking. Through 7 systems and 25 models across BIGOS and PELCRA, it reveals clear performance differences driven by system type, model size, and speech style, with results publicized via dashboards and open-source tools. The framework enhances reproducibility, encourages data sharing, and can be extended to other languages, providing a concrete path toward more robust and comparable ASR evaluations in low-resource or underrepresented languages.

Abstract

Speech datasets available in the public domain are often underutilized because of challenges in discoverability and interoperability. A comprehensive framework has been designed to survey, catalog, and curate available speech datasets, which allows replicable evaluation of automatic speech recognition (ASR) systems. A case study focused on the Polish language was conducted; the framework was applied to curate more than 24 datasets and evaluate 25 combinations of ASR systems and models. This research constitutes the most extensive comparison to date of both commercial and free ASR systems for the Polish language. It draws insights from 600 system-model-test set evaluations, marking a significant advancement in both scale and comprehensiveness. The results of surveys and performance comparisons are available as interactive dashboards (https://huggingface.co/spaces/amu-cai/pl-asr-leaderboard) along with curated datasets (https://huggingface.co/datasets/amu-cai/pl-asr-bigos-v2, https://huggingface.co/datasets/pelcra/pl-asr-pelcra-for-bigos) and the open challenge call (https://poleval.pl/tasks/task3). Tools used for evaluation are open-sourced (https://github.com/goodmike31/pl-asr-bigos-tools), facilitating replication and adaptation for other languages, as well as continuous expansion with new datasets and systems.

Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish

TL;DR

The paper tackles the lack of standardized Polish ASR benchmarks caused by discoverability and interoperability issues. It introduces a three-part framework—surveying datasets, curating a Polish benchmark (BIGO S) from 24 openly available datasets, and evaluating ASR systems—to enable reproducible, scalable benchmarking. Through 7 systems and 25 models across BIGOS and PELCRA, it reveals clear performance differences driven by system type, model size, and speech style, with results publicized via dashboards and open-source tools. The framework enhances reproducibility, encourages data sharing, and can be extended to other languages, providing a concrete path toward more robust and comparable ASR evaluations in low-resource or underrepresented languages.

Abstract

Speech datasets available in the public domain are often underutilized because of challenges in discoverability and interoperability. A comprehensive framework has been designed to survey, catalog, and curate available speech datasets, which allows replicable evaluation of automatic speech recognition (ASR) systems. A case study focused on the Polish language was conducted; the framework was applied to curate more than 24 datasets and evaluate 25 combinations of ASR systems and models. This research constitutes the most extensive comparison to date of both commercial and free ASR systems for the Polish language. It draws insights from 600 system-model-test set evaluations, marking a significant advancement in both scale and comprehensiveness. The results of surveys and performance comparisons are available as interactive dashboards (https://huggingface.co/spaces/amu-cai/pl-asr-leaderboard) along with curated datasets (https://huggingface.co/datasets/amu-cai/pl-asr-bigos-v2, https://huggingface.co/datasets/pelcra/pl-asr-pelcra-for-bigos) and the open challenge call (https://poleval.pl/tasks/task3). Tools used for evaluation are open-sourced (https://github.com/goodmike31/pl-asr-bigos-tools), facilitating replication and adaptation for other languages, as well as continuous expansion with new datasets and systems.
Paper Structure (44 sections, 7 figures, 19 tables)

This paper contains 44 sections, 7 figures, 19 tables.

Figures (7)

  • Figure 1: Architecture of data curation and ASR evaluation framework.
  • Figure 2: ASR evaluation process data flow
  • Figure 3: Box plot of WER for systems evaluated on the BIGOS dataset.
  • Figure 4: Example evaluation results available on the Polish ASR quality dashboard.
  • Figure 5: ASR systems accuracy across speaker age groups.
  • ...and 2 more figures