Table of Contents
Fetching ...

Fair Representation in Parliamentary Summaries: Measuring and Mitigating Inclusion Bias

Eoghan Cunningham, James Cross, Derek Greene

Abstract

The The use of Large language models (LLMs) to summarise parliamentary proceedings presents a promising means of increasing the accessibility of democratic participation. However, as these systems increasingly mediate access to political information -- filtering and framing content before it reaches users -- there are important fairness considerations to address. In this work, we evaluate 5 LLMs (both proprietary and open-weight) in the summarisation of plenary debates from the European Parliament to investigate the representational biases that emerge in this context. We develop an attribution-aware evaluation framework to measure speaker-level inclusion and mis-representation in debate summaries. Across all models and experiments, we find that speakers are less accurately represented in the final summary on the basis of (i) their speaking-order (speeches in the middle of the debate were systematically excluded), (ii) language spoken (non-English speakers were less faithfully represented), and (iii) political affiliations (better outcomes for left-of-centre parties). We further show how biases in these contexts can be decomposed to distinguish inclusion bias (systematic omission) from hallucination bias (systematic misrepresentation), and explore the effect of different mitigation strategies. Prompting strategies do not affect these biases. Instead, we propose a hierarchical summarisation method that decomposes the task into simpler extraction and aggregation steps, which we show significantly improves the positional/speaking-order bias across all models. These findings underscore the need for domain-sensitive evaluation metrics and ethical oversight in the deployment of LLMs for multilingual democratic applications.

Fair Representation in Parliamentary Summaries: Measuring and Mitigating Inclusion Bias

Abstract

The The use of Large language models (LLMs) to summarise parliamentary proceedings presents a promising means of increasing the accessibility of democratic participation. However, as these systems increasingly mediate access to political information -- filtering and framing content before it reaches users -- there are important fairness considerations to address. In this work, we evaluate 5 LLMs (both proprietary and open-weight) in the summarisation of plenary debates from the European Parliament to investigate the representational biases that emerge in this context. We develop an attribution-aware evaluation framework to measure speaker-level inclusion and mis-representation in debate summaries. Across all models and experiments, we find that speakers are less accurately represented in the final summary on the basis of (i) their speaking-order (speeches in the middle of the debate were systematically excluded), (ii) language spoken (non-English speakers were less faithfully represented), and (iii) political affiliations (better outcomes for left-of-centre parties). We further show how biases in these contexts can be decomposed to distinguish inclusion bias (systematic omission) from hallucination bias (systematic misrepresentation), and explore the effect of different mitigation strategies. Prompting strategies do not affect these biases. Instead, we propose a hierarchical summarisation method that decomposes the task into simpler extraction and aggregation steps, which we show significantly improves the positional/speaking-order bias across all models. These findings underscore the need for domain-sensitive evaluation metrics and ethical oversight in the deployment of LLMs for multilingual democratic applications.

Paper Structure

This paper contains 24 sections, 9 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Debate summarisation workflows. Flat debates summaries concatenate all interventions into a single document and summarise. Hierarchical summaries use a multi-step process where each intervention in the debate is summarised independently before aggregation. In the 2-level hierarchical approach, we extract information from intermediate intervention to aggregate thematically (i.e., "summarise all arguments/issues/proposals") prior to the final aggregation.
  • Figure 2: Proposed evaluation workflow. The reconstructor function recreates a structured version of a speaker's intervention using the final debate summary as input. If the speaker's contribution can be accurately reconstructed, we conclude the summary attends to that speaker and their contributions accurately.
  • Figure 3: Speaker Order Bias.BERTScore measures the similarity between a speaker's reconstructed intervention summary $\hat{s}$ (based on the information in the debate summary), and their original intervention $i$. Relative Speaker Order ($\frac{k}{n}$) represents the temporal position of their intervention in the debate ($k$) adjusted for the number of interventions in the debate ($n$) with $\frac{k}{n}\approx0$ representing the earliest speakers in the debate and $\frac{k}{n}=1$ representing the last intervention.
  • Figure 4: Speaking Order Bias -- Mitigating the lost-in-the-middle problem.BERTScore measures the similarity between a speaker's reconstructed intervention summary $\hat{s}$ (based on the information in the debate summary), and their original intervention. Relative Speaker Order represents the temporal position of their intervention in the debate, with lower scores representing earlier contributions.
  • Figure 5: European Parliament Languages. Our categorisation of language resource levels is based on the number of Wikipedia articles available in each language. We find broad agreement between the prevalence of each language on Wikipedia and its prevalence in our EP speech sample.
  • ...and 4 more figures