The Wisdom of Partisan Crowds: Comparing Collective Intelligence in Humans and LLM-based Agents

Yun-Shiuan Chuang; Siddharth Suresh; Nikunj Harlalka; Agam Goyal; Robert Hawkins; Sijia Yang; Dhavan Shah; Junjie Hu; Timothy T. Rogers

The Wisdom of Partisan Crowds: Comparing Collective Intelligence in Humans and LLM-based Agents

Yun-Shiuan Chuang, Siddharth Suresh, Nikunj Harlalka, Agam Goyal, Robert Hawkins, Sijia Yang, Dhavan Shah, Junjie Hu, Timothy T. Rogers

TL;DR

This paper investigates whether the wisdom of partisan crowds (WOC) extends to groups of LLM-based agents role-playing Democrat and Republican personas. It adopts a Becker-style benchmark to quantify WOC, partisan bias, and human-likeness, using two LLMs (ChatGPT and Vicuna) and varying prompting conditions (detailed vs simple personas; with/without chain-of-thought) plus supervised fine-tuning on human data. Key findings show that LLM agents exhibit WOC-like error reduction in the absence of chain-of-thought and with detailed personas, while chain-of-thought prompts attenuate WOC but enhance human-like partisan bias; fine-tuning further improves human-like dynamics but can introduce overfitting on unseen questions. The work demonstrates both the potential and limitations of LLM-based agents as models or simulators of human collective intelligence and highlights how human data can guide the design of socially intelligent AI agents.

Abstract

Human groups are able to converge on more accurate beliefs through deliberation, even in the presence of polarization and partisan bias -- a phenomenon known as the "wisdom of partisan crowds." Generated agents powered by Large Language Models (LLMs) are increasingly used to simulate human collective behavior, yet few benchmarks exist for evaluating their dynamics against the behavior of human groups. In this paper, we examine the extent to which the wisdom of partisan crowds emerges in groups of LLM-based agents that are prompted to role-play as partisan personas (e.g., Democrat or Republican). We find that they not only display human-like partisan biases, but also converge to more accurate beliefs through deliberation as humans do. We then identify several factors that interfere with convergence, including the use of chain-of-thought prompt and lack of details in personas. Conversely, fine-tuning on human data appears to enhance convergence. These findings show the potential and limitations of LLM-based agents as a model of human collective intelligence.

The Wisdom of Partisan Crowds: Comparing Collective Intelligence in Humans and LLM-based Agents

TL;DR

Abstract

Paper Structure (58 sections, 1 equation, 6 figures, 2 tables)

This paper contains 58 sections, 1 equation, 6 figures, 2 tables.

Introduction
Methods
Experimental Procedure
Formal notation
Personas and Agent Specification
Personas and agent specification
Chain-of-thought reasoning (CoT)
Fine-Tuning the LLMs with Human Data
Evaluation Metrics
Wisdom of Partisan Crowds Effect (WOC)
Partisan Bias
Human Likeness Index
Extreme Values ($\textit{Ext.\%}$)
Revision Coefficient
Results and Discussion
...and 43 more sections

Figures (6)

Figure 1: Experimental design comparing social feedback effects on LLM agents' estimations of partisan-biased factual questions. LLM agents role-playing Democrat and Republican update their estimates after considering their peers' average responses becker2019wisdom.
Figure 2: Average Normalized Group Error ($\overline{\varepsilon}_t$) for (a) human crowds and (b) LLM agents (ChatGPT) across the experimental settings. Error bars indicating standard errors.
Figure 3: Normalized group mean $\eta_{p,t}$ over three rounds, averaged across 12 group experiments (red for Republicans, blue for Democrats), with error bars for standard errors. Each panel consists of four columns representing different data sets: Column 1 shows human data. Columns 2 to 4 shows LLM (ChatGPT) agents' data. Column 2 depicts LLM role-playing detailed personas and without CoT reasoning (the configuration with the highest $HLI$); Column 3 presents LLM results before fine-tuning; and Column 4 illustrates LLM after fine-tuning. Panel (a) includes questions from the training set ($5 \leq q \leq 8$) used for fine-tuning the LLM agents, while Panel (b) displays questions from the hold-out test set ($1 \leq q \leq 4$). Question-specific WOC effects ($\Delta \varepsilon_q\xspace$) and partisan biases ($\beta_{\text{PB}_q}\xspace$, if expected) are overlaid for comparison.
Figure 4: Mechanism of why the WOC effect emerges from crowds of Human crowds and LLM agents. Panel (a) and (b) shows the examples where both humans and LLM agents show the WOC effect through social interaction (i.e., the question-specific WOC effect $\Delta \varepsilon_q\xspace < 0$). In contrast, in panel (c), LLM agents do not converge towards the ground truth while humans do. In each panel, the line plot shows the normalized group mean $\eta_{p,t}$ trajectory over three rounds, averaged across 12 runs (red for Republicans, blue for Democrats), with error bars indicating standard errors. The $r$ in each panel demonstrate the revision coefficient - the correlation $r_\text{adj}$ between the adjusted initial individual error $\widetilde{e}_{i,p,r,q}$ and adjusted estimate revisions $\widetilde{\Delta x}_{i,p,r,q}$ (\ref{['sec:eval_metrics']}). Similar to human crowds, the LLM agents show the WOC effect only when $r_\text{adj}>0$. The results of the full set of questions are shown in Figure \ref{['fig:llm_align_misalign_full_list']} (\ref{['app:result_revision_coefficient']}). $^{*}$: $p < .01$ (Bonferroni corrected for all questions); $^\textit{ns}$: not significant.
Figure 5: Analysis of the mechanism of LLM agents' wisdom of crowds (WOC) effect at the individual level. Panel (a) shows the questions where LLM agents exhibit the WOC effect ($\Delta \varepsilon_q < 0)$. Panel (b) shows the questions where LLM agents do not show the WOC effect. Within each panel, the questions are ordered by their revision correlation $r_\text{adj}$. In each panel, the line plot shows the normalized group mean $\eta_{p,t}$ trajectory over three rounds, averaged across 12 runs (red for Republicans, blue for Democrats), with error bars indicating standard errors. The in each panel demonstrate the revision coefficient, the correlation $r_\text{adj}$ between the adjusted initial individual error $\widetilde{e}_{i,p,r,q}$ and adjusted estimate revisions $\widetilde{\Delta x}_{i,p,r,q}$ (\ref{['sec:eval_metrics']}). The LLM agents show the WOC effect only when $r_\text{adj}>0$ (panel a). $^{*}$: $p < .01$ (Bonferroni corrected for all questions); $^\textit{ns}$: not significant.
...and 1 more figures

The Wisdom of Partisan Crowds: Comparing Collective Intelligence in Humans and LLM-based Agents

TL;DR

Abstract

The Wisdom of Partisan Crowds: Comparing Collective Intelligence in Humans and LLM-based Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (6)