Table of Contents
Fetching ...

Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs

Shenzhe Zhu, Shu Yang, Michiel A. Bakker, Alex Pentland, Jiaxin Pei

TL;DR

This work tackles the fairness and scalability challenges of using LLMs to summarize large-scale public deliberations. It introduces DeliberationBank, a large, human-grounded dataset of opinions and summary judgments, and DeliberationJudge, a DeBERTa-based evaluator tuned to assess summaries along representativeness, informativeness, neutrality, and policy relevance. Through benchmarking 18 LLMs, the study finds persistent minority underrepresentation and limited alignment between generic LLM judges and human judgments, while showing that DeliberationJudge offers higher reliability and far greater efficiency. The authors propose broader, diversity-aware evaluation and reward-modeling directions to promote more representative and equitable deliberation systems for policymaking.

Abstract

Large-scale public deliberations generate thousands of free-form contributions that must be synthesized into representative and neutral summaries for policy use. While LLMs have been shown as a promising tool to generate summaries for large-scale deliberations, they also risk underrepresenting minority perspectives and exhibiting bias with respect to the input order, raising fairness concerns in high-stakes contexts. Studying and fixing these issues requires a comprehensive evaluation at a large scale, yet current practice often relies on LLMs as judges, which show weak alignment with human judgments. To address this, we present DeliberationBank, a large-scale human-grounded dataset with (1) opinion data spanning ten deliberation questions created by 3,000 participants and (2) summary judgment data annotated by 4,500 participants across four dimensions (representativeness, informativeness, neutrality, policy approval). Using these datasets, we train DeliberationJudge, a fine-tuned DeBERTa model that can rate deliberation summaries from individual perspectives. DeliberationJudge is more efficient and more aligned with human judgements compared to a wide range of LLM judges. With DeliberationJudge, we evaluate 18 LLMs and reveal persistent weaknesses in deliberation summarization, especially underrepresentation of minority positions. Our framework provides a scalable and reliable way to evaluate deliberation summarization, helping ensure AI systems are more representative and equitable for policymaking.

Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs

TL;DR

This work tackles the fairness and scalability challenges of using LLMs to summarize large-scale public deliberations. It introduces DeliberationBank, a large, human-grounded dataset of opinions and summary judgments, and DeliberationJudge, a DeBERTa-based evaluator tuned to assess summaries along representativeness, informativeness, neutrality, and policy relevance. Through benchmarking 18 LLMs, the study finds persistent minority underrepresentation and limited alignment between generic LLM judges and human judgments, while showing that DeliberationJudge offers higher reliability and far greater efficiency. The authors propose broader, diversity-aware evaluation and reward-modeling directions to promote more representative and equitable deliberation systems for policymaking.

Abstract

Large-scale public deliberations generate thousands of free-form contributions that must be synthesized into representative and neutral summaries for policy use. While LLMs have been shown as a promising tool to generate summaries for large-scale deliberations, they also risk underrepresenting minority perspectives and exhibiting bias with respect to the input order, raising fairness concerns in high-stakes contexts. Studying and fixing these issues requires a comprehensive evaluation at a large scale, yet current practice often relies on LLMs as judges, which show weak alignment with human judgments. To address this, we present DeliberationBank, a large-scale human-grounded dataset with (1) opinion data spanning ten deliberation questions created by 3,000 participants and (2) summary judgment data annotated by 4,500 participants across four dimensions (representativeness, informativeness, neutrality, policy approval). Using these datasets, we train DeliberationJudge, a fine-tuned DeBERTa model that can rate deliberation summaries from individual perspectives. DeliberationJudge is more efficient and more aligned with human judgements compared to a wide range of LLM judges. With DeliberationJudge, we evaluate 18 LLMs and reveal persistent weaknesses in deliberation summarization, especially underrepresentation of minority positions. Our framework provides a scalable and reliable way to evaluate deliberation summarization, helping ensure AI systems are more representative and equitable for policymaking.

Paper Structure

This paper contains 49 sections, 14 equations, 33 figures, 12 tables, 1 algorithm.

Figures (33)

  • Figure 1: Overview of our benchmark framework(see §\ref{['sec:benchmark']}), including opinion collection (§\ref{['sec:opinion_dataset']}), judge model training (§\ref{['sec:summary-judges']}), and LLM-based deliberation summarization evaluation (§\ref{['sec:summary-quality-evaluation']}).
  • Figure 2: DeliberationBank creation pipeline
  • Figure 3: Heatmap of consistency between human and LLM judges on two judgment tasks. Left: Rating–Pearson $r$; Middle: Rating–Spearman $\rho$; Right: Comparison–Spearman $\rho$. Lighter to darker colors represent lower to higher correlations.
  • Figure 4: Left: Judging Time vs. Spearman Correlation. DeliberationJudge achieves both the lowest time and the highest correlation compared to LLMs; Right: Scaling Stability. Total judging time of DeliberationJudge and eight LLMs as the number of comments increases; solid lines show means and shaded areas denote min–max ranges. In the figures, DelibJudge=DeliberationJudge.
  • Figure 5: Comparative performance on Representativeness, Informativeness, Neutrality, and Policy Approval (mean $\pm$ 95% CI); higher is better.
  • ...and 28 more figures