Table of Contents
Fetching ...

Strategies for political-statement segmentation and labelling in unstructured text

Dmitry Nikolaev, Sean Papay

TL;DR

The paper tackles the problem of jointly segmenting raw political text and labeling statements with MARPOR categories to enable scalable, cross-domain analysis. It compares three unified models—CRF with a multilingual encoder, fine-tuned Flan-T5, and in-context learning with Llama 3.1—on manifestos and parliamentary debates, highlighting trade-offs between accuracy and compute. Key findings show that the CRF approach offers efficient, competitive performance when statement boundaries are not provided, while fine-tuned transformers yield higher in-domain accuracy at substantial compute cost; constrained decoding in the in-context setting is less effective currently. The authors demonstrate the method’s applicability by applying it to UK Hansard debates, revealing interpretable party trajectories, and emphasize the need for faster inference to scale to large corpora.

Abstract

Analysis of parliamentary speeches and political-party manifestos has become an integral area of computational study of political texts. While speeches have been overwhelmingly analysed using unsupervised methods, a large corpus of manifestos with by-statement political-stance labels has been created by the participants of the MARPOR project. It has been recently shown that these labels can be predicted by a neural model; however, the current approach relies on provided statement boundaries, limiting out-of-domain applicability. In this work, we propose and test a range of unified split-and-label frameworks -- based on linear-chain CRFs, fine-tuned text-to-text models, and the combination of in-context learning with constrained decoding -- that can be used to jointly segment and classify statements from raw textual data. We show that our approaches achieve competitive accuracy when applied to raw text of political manifestos, and then demonstrate the research potential of our method by applying it to the records of the UK House of Commons and tracing the political trajectories of four major parties in the last three decades.

Strategies for political-statement segmentation and labelling in unstructured text

TL;DR

The paper tackles the problem of jointly segmenting raw political text and labeling statements with MARPOR categories to enable scalable, cross-domain analysis. It compares three unified models—CRF with a multilingual encoder, fine-tuned Flan-T5, and in-context learning with Llama 3.1—on manifestos and parliamentary debates, highlighting trade-offs between accuracy and compute. Key findings show that the CRF approach offers efficient, competitive performance when statement boundaries are not provided, while fine-tuned transformers yield higher in-domain accuracy at substantial compute cost; constrained decoding in the in-context setting is less effective currently. The authors demonstrate the method’s applicability by applying it to UK Hansard debates, revealing interpretable party trajectories, and emphasize the need for faster inference to scale to large corpora.

Abstract

Analysis of parliamentary speeches and political-party manifestos has become an integral area of computational study of political texts. While speeches have been overwhelmingly analysed using unsupervised methods, a large corpus of manifestos with by-statement political-stance labels has been created by the participants of the MARPOR project. It has been recently shown that these labels can be predicted by a neural model; however, the current approach relies on provided statement boundaries, limiting out-of-domain applicability. In this work, we propose and test a range of unified split-and-label frameworks -- based on linear-chain CRFs, fine-tuned text-to-text models, and the combination of in-context learning with constrained decoding -- that can be used to jointly segment and classify statements from raw textual data. We show that our approaches achieve competitive accuracy when applied to raw text of political manifestos, and then demonstrate the research potential of our method by applying it to the records of the UK House of Commons and tracing the political trajectories of four major parties in the last three decades.

Paper Structure

This paper contains 34 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An example of an in-context learning prompt, comprising natural-language instructions, in-context learning examples, and the input text. The instructions are shown verbatim; in-context learning examples shown are real examples from the dataset but are truncated for space. The model's response to this prompt, decoded with constraints, will constitute the prediction for the input text.
  • Figure 2: Political trajectories of major UK parties traced by projecting yearly salience vectors of CMP categories in their parliamentary speeches using non-negative matrix factorization and the original CMP data as the training set.
  • Figure 3: Yearly speech counts of four major UK parties recorded in Hansard over the last four decades.
  • Figure 4: RILE scores for four major UK parties computed based on the House of Commons speeches by their members.
  • Figure 5: RILE scores for four major UK parties computed based on the House of Commons speeches by their members, with label 305, "Political authority", excluded from the estimation.
  • ...and 1 more figures