Table of Contents
Fetching ...

Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification

Taja Kuzman Pungeršek, Peter Rupnik, Daniela Širinić, Nikola Ljubešić

TL;DR

ParlaCAP presents a scalable, domain-specific approach to classifying parliamentary speeches using CAP topics by combining an LLM teacher-student annotation workflow with multilingual transformer fine-tuning, applied to the ParlaMint corpus across 28 European parliaments. The authors train on in-domain data annotated by a large language model, then fine-tune a smaller multilingual encoder, achieving performance comparable to human annotators and surpassing out-of-domain CAP models. The ParlaCAP dataset combines topic labels, sentiment, and rich speaker and party metadata for 8 million speeches, enabling cross-country analysis of attention, sentiment, and gender dynamics in European legislatures. Empirically, targeted data augmentation, domain-pretraining, and in-domain labeling yield robust topic classification across languages, with the final ParlaCAP model released for scalable annotation with a confidence-based Mix option. The work advances democratic studies by providing a cost-effective, scalable resource for analyzing agenda setting across multiple languages and institutional contexts, with practical implications for comparative politics and political communication.

Abstract

This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe, and proposes a cost-effective method for building domain-specific policy topic classifiers. Applying the Comparative Agendas Project (CAP) schema to the multilingual ParlaMint corpus of over 8 million speeches from 28 parliaments of European countries and autonomous regions, we follow a teacher-student framework in which a high-performing large language model (LLM) annotates in-domain training data and a multilingual encoder model is fine-tuned on these annotations for scalable data annotation. We show that this approach produces a classifier tailored to the target domain. Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data. In addition to the CAP annotations, the ParlaCAP dataset offers rich speaker and party metadata, as well as sentiment predictions coming from the ParlaSent multilingual transformer model, enabling comparative research on political attention and representation across countries. We illustrate the analytical potential of the dataset with three use cases, examining the distribution of parliamentary attention across policy topics, sentiment patterns in parliamentary speech, and gender differences in policy attention.

Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification

TL;DR

ParlaCAP presents a scalable, domain-specific approach to classifying parliamentary speeches using CAP topics by combining an LLM teacher-student annotation workflow with multilingual transformer fine-tuning, applied to the ParlaMint corpus across 28 European parliaments. The authors train on in-domain data annotated by a large language model, then fine-tune a smaller multilingual encoder, achieving performance comparable to human annotators and surpassing out-of-domain CAP models. The ParlaCAP dataset combines topic labels, sentiment, and rich speaker and party metadata for 8 million speeches, enabling cross-country analysis of attention, sentiment, and gender dynamics in European legislatures. Empirically, targeted data augmentation, domain-pretraining, and in-domain labeling yield robust topic classification across languages, with the final ParlaCAP model released for scalable annotation with a confidence-based Mix option. The work advances democratic studies by providing a cost-effective, scalable resource for analyzing agenda setting across multiple languages and institutional contexts, with practical implications for comparative politics and political communication.

Abstract

This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe, and proposes a cost-effective method for building domain-specific policy topic classifiers. Applying the Comparative Agendas Project (CAP) schema to the multilingual ParlaMint corpus of over 8 million speeches from 28 parliaments of European countries and autonomous regions, we follow a teacher-student framework in which a high-performing large language model (LLM) annotates in-domain training data and a multilingual encoder model is fine-tuned on these annotations for scalable data annotation. We show that this approach produces a classifier tailored to the target domain. Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data. In addition to the CAP annotations, the ParlaCAP dataset offers rich speaker and party metadata, as well as sentiment predictions coming from the ParlaSent multilingual transformer model, enabling comparative research on political attention and representation across countries. We illustrate the analytical potential of the dataset with three use cases, examining the distribution of parliamentary attention across policy topics, sentiment patterns in parliamentary speech, and gender differences in policy attention.
Paper Structure (21 sections, 7 figures, 7 tables)

This paper contains 21 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The inter-annotator agreement in terms of nominal Krippendorff's alpha between human annotators and the large language model (GPT-4o).
  • Figure 2: Performance of the ParlaCAP model on the English test dataset after removal of Mix instances predicted with low confidence.
  • Figure 3: Scope of the ParlaCAP dataset, showing the number of speeches in each parliamentary dataset and their temporal coverage.
  • Figure 4: Probability distribution of automatically annotated CAP labels for each parliament included in the ParlaCAP dataset.
  • Figure 5: Mean sentiment across CAP topics, sentiment ranging from 0 (negative) to 5 (positive). Red denotes more positive sentiment and blue representing more negative sentiment towards the topic.
  • ...and 2 more figures