Table of Contents
Fetching ...

ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

Nikola Ljubešić, Peter Rupnik, Ivan Porupski, Taja Kuzman Pungeršek

TL;DR

ParlaSpeech 3.0 delivers a richly annotated, cross-language spoken parliamentary corpus collection for four under-resourced Slavic languages, totaling ~6,000 hours. It extends ParlaMint with five automatic annotation layers spanning linguistic (UD), sentiment, disfluencies (filled pauses), precise alignments (word- and grapheme-level), and primary stress, with Croatian and Serbian also including detailed stress markers. The dataset is provided in JSONL, TextGrid, and via concordancers to enable robust cross-linguistic research in linguistics, discourse studies, and speech technology, and is demonstrated through an analysis of acoustic correlates of sentiment across all four languages. Methodologically, ParlaSpeech combines transformer-based sentiment modeling (ParlaSent), MFA-based alignment, and state-of-the-art prosodic annotation (Pause, Stress) to offer high-quality, reusable resources for multilingual linguistic and multimodal analyses. The work advances accessibility and utility of parliamentary data for phonetic, syntactic, and sociolinguistic research and sets a foundation for expanding language coverage and annotation layers in future work.

Abstract

ParlaSpeech is a collection of spoken parliamentary corpora currently spanning four Slavic languages - Croatian, Czech, Polish and Serbian - all together 6 thousand hours in size. The corpora were built in an automatic fashion from the ParlaMint transcripts and their corresponding metadata, which were aligned to the speech recordings of each corresponding parliament. In this release of the dataset, each of the corpora is significantly enriched with various automatic annotation layers. The textual modality of all four corpora has been enriched with linguistic annotations and sentiment predictions. Similar to that, their spoken modality has been automatically enriched with occurrences of filled pauses, the most frequent disfluency in typical speech. Two out of the four languages have been additionally enriched with detailed word- and grapheme-level alignments, and the automatic annotation of the position of primary stress in multisyllabic words. With these enrichments, the usefulness of the underlying corpora has been drastically increased for downstream research across multiple disciplines, which we showcase through an analysis of acoustic correlates of sentiment. All the corpora are made available for download in JSONL and TextGrid formats, as well as for search through a concordancer.

ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

TL;DR

ParlaSpeech 3.0 delivers a richly annotated, cross-language spoken parliamentary corpus collection for four under-resourced Slavic languages, totaling ~6,000 hours. It extends ParlaMint with five automatic annotation layers spanning linguistic (UD), sentiment, disfluencies (filled pauses), precise alignments (word- and grapheme-level), and primary stress, with Croatian and Serbian also including detailed stress markers. The dataset is provided in JSONL, TextGrid, and via concordancers to enable robust cross-linguistic research in linguistics, discourse studies, and speech technology, and is demonstrated through an analysis of acoustic correlates of sentiment across all four languages. Methodologically, ParlaSpeech combines transformer-based sentiment modeling (ParlaSent), MFA-based alignment, and state-of-the-art prosodic annotation (Pause, Stress) to offer high-quality, reusable resources for multilingual linguistic and multimodal analyses. The work advances accessibility and utility of parliamentary data for phonetic, syntactic, and sociolinguistic research and sets a foundation for expanding language coverage and annotation layers in future work.

Abstract

ParlaSpeech is a collection of spoken parliamentary corpora currently spanning four Slavic languages - Croatian, Czech, Polish and Serbian - all together 6 thousand hours in size. The corpora were built in an automatic fashion from the ParlaMint transcripts and their corresponding metadata, which were aligned to the speech recordings of each corresponding parliament. In this release of the dataset, each of the corpora is significantly enriched with various automatic annotation layers. The textual modality of all four corpora has been enriched with linguistic annotations and sentiment predictions. Similar to that, their spoken modality has been automatically enriched with occurrences of filled pauses, the most frequent disfluency in typical speech. Two out of the four languages have been additionally enriched with detailed word- and grapheme-level alignments, and the automatic annotation of the position of primary stress in multisyllabic words. With these enrichments, the usefulness of the underlying corpora has been drastically increased for downstream research across multiple disciplines, which we showcase through an analysis of acoustic correlates of sentiment. All the corpora are made available for download in JSONL and TextGrid formats, as well as for search through a concordancer.

Paper Structure

This paper contains 19 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Relative distribution of the number of speakers by year-of-birth and gender across all four parliaments (top), and relative distribution of the number of words spoken by speakers across age, gender, and parliaments (bottom).
  • Figure 2: Visualization of the $\mathrm{P}\left(\mathrm{Neg}>\mathrm{Pos}\right)$ effect size on three strong sentiment predictors -- pitch (F0), intensity (Int) and speech rate (SR) -- across the four languages. Speaker average and instance results are shown. Statistically non-significant results (instance-level speech rate in Czech and Polish) are omitted from the plot.