Table of Contents
Fetching ...

The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings

Nikola Ljubešić, Peter Rupnik, Danijel Koržinek

TL;DR

The paper tackles the scarcity of aligned speech-text data for less-resourced languages by building large-scale, open datasets from parliamentary proceedings. It introduces a scalable pipeline that combines modern end-to-end ASR, text normalization, and a Levenshtein-based matching strategy to align long audio recordings with transcripts, even when data are incomplete or out of order. The authors generate over 5,000 hours of aligned Croatian, Polish, and Serbian data and release them in three complementary formats (master FAIR, HuggingFace, and concordancer) to support ASR development and linguistic research. This work demonstrates a practical pathway to extend high-quality speech resources to many languages and lays groundwork for expanding ParlaSpeech to additional jurisdictions and languages. The approach enables broader access to spoken-language resources, facilitating both technological advancement and linguistic analysis while highlighting considerations around data bias and alignment quality.

Abstract

Recent significant improvements in speech and language technologies come both from self-supervised approaches over raw language data as well as various types of explicit supervision. To ensure high-quality processing of spoken data, the most useful type of explicit supervision is still the alignment between the speech signal and its corresponding text transcript, which is a data type that is not available for many languages. In this paper, we present our approach to building large and open speech-and-text-aligned datasets of less-resourced languages based on transcripts of parliamentary proceedings and their recordings. Our starting point are the ParlaMint comparable corpora of transcripts of parliamentary proceedings of 26 national European parliaments. In the pilot run on expanding the ParlaMint corpora with aligned publicly available recordings, we focus on three Slavic languages, namely Croatian, Polish, and Serbian. The main challenge of our approach is the lack of any global alignment between the ParlaMint texts and the available recordings, as well as the sometimes varying data order in each of the modalities, which requires a novel approach in aligning long sequences of text and audio in a large search space. The results of this pilot run are three high-quality datasets that span more than 5,000 hours of speech and accompanying text transcripts. Although these datasets already make a huge difference in the availability of spoken and textual data for the three languages, we want to emphasize the potential of the presented approach in building similar datasets for many more languages.

The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings

TL;DR

The paper tackles the scarcity of aligned speech-text data for less-resourced languages by building large-scale, open datasets from parliamentary proceedings. It introduces a scalable pipeline that combines modern end-to-end ASR, text normalization, and a Levenshtein-based matching strategy to align long audio recordings with transcripts, even when data are incomplete or out of order. The authors generate over 5,000 hours of aligned Croatian, Polish, and Serbian data and release them in three complementary formats (master FAIR, HuggingFace, and concordancer) to support ASR development and linguistic research. This work demonstrates a practical pathway to extend high-quality speech resources to many languages and lays groundwork for expanding ParlaSpeech to additional jurisdictions and languages. The approach enables broader access to spoken-language resources, facilitating both technological advancement and linguistic analysis while highlighting considerations around data bias and alignment quality.

Abstract

Recent significant improvements in speech and language technologies come both from self-supervised approaches over raw language data as well as various types of explicit supervision. To ensure high-quality processing of spoken data, the most useful type of explicit supervision is still the alignment between the speech signal and its corresponding text transcript, which is a data type that is not available for many languages. In this paper, we present our approach to building large and open speech-and-text-aligned datasets of less-resourced languages based on transcripts of parliamentary proceedings and their recordings. Our starting point are the ParlaMint comparable corpora of transcripts of parliamentary proceedings of 26 national European parliaments. In the pilot run on expanding the ParlaMint corpora with aligned publicly available recordings, we focus on three Slavic languages, namely Croatian, Polish, and Serbian. The main challenge of our approach is the lack of any global alignment between the ParlaMint texts and the available recordings, as well as the sometimes varying data order in each of the modalities, which requires a novel approach in aligning long sequences of text and audio in a large search space. The results of this pilot run are three high-quality datasets that span more than 5,000 hours of speech and accompanying text transcripts. Although these datasets already make a huge difference in the availability of spoken and textual data for the three languages, we want to emphasize the potential of the presented approach in building similar datasets for many more languages.
Paper Structure (22 sections, 3 figures, 1 table)

This paper contains 22 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Diagram of long speech to text sequence alignment pipeline for processing a single audio-text file pair. Circles are intermediary data structures. Rectangles are processes.
  • Figure 2: Illustration of the matching algorithm. The purpose is to find portions of the reference that match the ASR output. Sequence and accuracy is not guaranteed.
  • Figure 3: Example of a search result on the noun "tehnologija" with a preceding adjective in the concordancer of the Croatian corpus. The recording can be accessed to the right, the metadata to the left.