Table of Contents
Fetching ...

RooseBERT: A New Deal For Political Language Modelling

Deborah Dore, Elena Cabrio, Serena Villata

TL;DR

RooseBERT targets the core challenge of political discourse analysis by domain-adapting a transformer to debates and speeches. It is pre-trained on a large, multi-source English political-debate corpus using cont and scr strategies with cased and uncased vocabularies, and is fine-tuned on four key tasks: sentiment analysis, stance detection, argument component detection/classification, and argument relation prediction. The model consistently surpasses general-purpose baselines and comparable political-domain models, with the scr-uncased variant often yielding the strongest results, and demonstrates robust cross-national generalization across US, UK, EU, and UN debates. The work provides a publicly available RooseBERT and highlights practical considerations such as the importance of domain-specific vocabularies and the computational trade-offs of larger models, while noting limitations like language scope and potential for misuse. Future work includes extending RooseBERT to other languages for cross-lingual political analysis and broader, real-world deployment considerations.

Abstract

The increasing amount of political debates and politics-related discussions calls for the definition of novel computational methods to automatically analyse such content with the final goal of lightening up political deliberation to citizens. However, the specificity of the political language and the argumentative form of these debates (employing hidden communication strategies and leveraging implicit arguments) make this task very challenging, even for current general-purpose pre-trained Language Models. To address this issue, we introduce a novel pre-trained Language Model for political discourse language called RooseBERT. Pre-training a language model on a specialised domain presents different technical and linguistic challenges, requiring extensive computational resources and large-scale data. RooseBERT has been trained on large political debate and speech corpora (8K debates, each composed of several sub-debates on different topics) in English. To evaluate its performances, we fine-tuned it on four downstream tasks related to political debate analysis, i.e., stance detection, sentiment analysis, argument component detection and classification, and argument relation prediction and classification. Our results demonstrate significant improvements over general-purpose Language Models on these four tasks, highlighting how domain-specific pre-training enhances performance in political debate analysis. We release RooseBERT for the research community.

RooseBERT: A New Deal For Political Language Modelling

TL;DR

RooseBERT targets the core challenge of political discourse analysis by domain-adapting a transformer to debates and speeches. It is pre-trained on a large, multi-source English political-debate corpus using cont and scr strategies with cased and uncased vocabularies, and is fine-tuned on four key tasks: sentiment analysis, stance detection, argument component detection/classification, and argument relation prediction. The model consistently surpasses general-purpose baselines and comparable political-domain models, with the scr-uncased variant often yielding the strongest results, and demonstrates robust cross-national generalization across US, UK, EU, and UN debates. The work provides a publicly available RooseBERT and highlights practical considerations such as the importance of domain-specific vocabularies and the computational trade-offs of larger models, while noting limitations like language scope and potential for misuse. Future work includes extending RooseBERT to other languages for cross-lingual political analysis and broader, real-world deployment considerations.

Abstract

The increasing amount of political debates and politics-related discussions calls for the definition of novel computational methods to automatically analyse such content with the final goal of lightening up political deliberation to citizens. However, the specificity of the political language and the argumentative form of these debates (employing hidden communication strategies and leveraging implicit arguments) make this task very challenging, even for current general-purpose pre-trained Language Models. To address this issue, we introduce a novel pre-trained Language Model for political discourse language called RooseBERT. Pre-training a language model on a specialised domain presents different technical and linguistic challenges, requiring extensive computational resources and large-scale data. RooseBERT has been trained on large political debate and speech corpora (8K debates, each composed of several sub-debates on different topics) in English. To evaluate its performances, we fine-tuned it on four downstream tasks related to political debate analysis, i.e., stance detection, sentiment analysis, argument component detection and classification, and argument relation prediction and classification. Our results demonstrate significant improvements over general-purpose Language Models on these four tasks, highlighting how domain-specific pre-training enhances performance in political debate analysis. We release RooseBERT for the research community.

Paper Structure

This paper contains 30 sections, 1 figure, 10 tables.

Figures (1)

  • Figure 1: Cosine similarity of the pre-training datasets.