Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs
Shenzhe Zhu, Shu Yang, Michiel A. Bakker, Alex Pentland, Jiaxin Pei
TL;DR
This work tackles the fairness and scalability challenges of using LLMs to summarize large-scale public deliberations. It introduces DeliberationBank, a large, human-grounded dataset of opinions and summary judgments, and DeliberationJudge, a DeBERTa-based evaluator tuned to assess summaries along representativeness, informativeness, neutrality, and policy relevance. Through benchmarking 18 LLMs, the study finds persistent minority underrepresentation and limited alignment between generic LLM judges and human judgments, while showing that DeliberationJudge offers higher reliability and far greater efficiency. The authors propose broader, diversity-aware evaluation and reward-modeling directions to promote more representative and equitable deliberation systems for policymaking.
Abstract
Large-scale public deliberations generate thousands of free-form contributions that must be synthesized into representative and neutral summaries for policy use. While LLMs have been shown as a promising tool to generate summaries for large-scale deliberations, they also risk underrepresenting minority perspectives and exhibiting bias with respect to the input order, raising fairness concerns in high-stakes contexts. Studying and fixing these issues requires a comprehensive evaluation at a large scale, yet current practice often relies on LLMs as judges, which show weak alignment with human judgments. To address this, we present DeliberationBank, a large-scale human-grounded dataset with (1) opinion data spanning ten deliberation questions created by 3,000 participants and (2) summary judgment data annotated by 4,500 participants across four dimensions (representativeness, informativeness, neutrality, policy approval). Using these datasets, we train DeliberationJudge, a fine-tuned DeBERTa model that can rate deliberation summaries from individual perspectives. DeliberationJudge is more efficient and more aligned with human judgements compared to a wide range of LLM judges. With DeliberationJudge, we evaluate 18 LLMs and reveal persistent weaknesses in deliberation summarization, especially underrepresentation of minority positions. Our framework provides a scalable and reliable way to evaluate deliberation summarization, helping ensure AI systems are more representative and equitable for policymaking.
