The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings

Nikola Ljubešić; Peter Rupnik; Danijel Koržinek

The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings

Nikola Ljubešić, Peter Rupnik, Danijel Koržinek

TL;DR

The paper tackles the scarcity of aligned speech-text data for less-resourced languages by building large-scale, open datasets from parliamentary proceedings. It introduces a scalable pipeline that combines modern end-to-end ASR, text normalization, and a Levenshtein-based matching strategy to align long audio recordings with transcripts, even when data are incomplete or out of order. The authors generate over 5,000 hours of aligned Croatian, Polish, and Serbian data and release them in three complementary formats (master FAIR, HuggingFace, and concordancer) to support ASR development and linguistic research. This work demonstrates a practical pathway to extend high-quality speech resources to many languages and lays groundwork for expanding ParlaSpeech to additional jurisdictions and languages. The approach enables broader access to spoken-language resources, facilitating both technological advancement and linguistic analysis while highlighting considerations around data bias and alignment quality.

Abstract

Recent significant improvements in speech and language technologies come both from self-supervised approaches over raw language data as well as various types of explicit supervision. To ensure high-quality processing of spoken data, the most useful type of explicit supervision is still the alignment between the speech signal and its corresponding text transcript, which is a data type that is not available for many languages. In this paper, we present our approach to building large and open speech-and-text-aligned datasets of less-resourced languages based on transcripts of parliamentary proceedings and their recordings. Our starting point are the ParlaMint comparable corpora of transcripts of parliamentary proceedings of 26 national European parliaments. In the pilot run on expanding the ParlaMint corpora with aligned publicly available recordings, we focus on three Slavic languages, namely Croatian, Polish, and Serbian. The main challenge of our approach is the lack of any global alignment between the ParlaMint texts and the available recordings, as well as the sometimes varying data order in each of the modalities, which requires a novel approach in aligning long sequences of text and audio in a large search space. The results of this pilot run are three high-quality datasets that span more than 5,000 hours of speech and accompanying text transcripts. Although these datasets already make a huge difference in the availability of spoken and textual data for the three languages, we want to emphasize the potential of the presented approach in building similar datasets for many more languages.

The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings

TL;DR

Abstract

The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings

Authors

TL;DR

Abstract

Table of Contents

Figures (3)