Table of Contents
Fetching ...

Benchmarking Overton Pluralism in LLMs

Elinor Poole-Dayan, Jiayi Wu, Taylor Sorensen, Jiaxin Pei, Michiel A. Bakker

TL;DR

The paper defines OvertonScore to quantify how well LLM outputs represent diverse viewpoints within the Overton window, and validates it via a large-scale human study (N=1209, 60 questions, 8 LLMs) plus an automated benchmark that correlates highly with human judgments (ρ=0.88). Findings show current models achieve roughly 0.35–0.41 unweighted and 0.48 weighted coverage, far from the ideal 1.0, indicating substantial room for improvement in pluralistic alignment. The automated LLM-judge benchmark (Gemini 2.5 Pro FS+FR) offers scalable, near-human predictive fidelity (MAE, ρ) to accelerate model development while preserving evaluation integrity. Together, these contributions offer a principled, scalable path toward building LLMs that more faithfully represent a spectrum of legitimate public viewpoints.

Abstract

We introduce a novel framework for measuring Overton pluralism in LLMs--the extent to which diverse viewpoints are represented in model outputs. We (i) formalize Overton pluralism as a set coverage metric (OvertonScore), (ii) conduct a large-scale U.S.-representative human study (N = 1209; 60 questions; 8 LLMs), and (iii) develop an automated benchmark that closely reproduces human judgments. On average, models achieve OvertonScores of 0.35--0.41, with DeepSeek V3 performing best; yet all models remain far below the theoretical maximum of 1.0, revealing substantial headroom for improvement. Because repeated large-scale human studies are costly and slow, scalable evaluation tools are essential for model development. Hence, we propose an automated benchmark that achieves high rank correlation with human judgments ($ρ=0.88$), providing a practical proxy without replacing human assessment. By turning pluralistic alignment from a normative aim into a measurable benchmark, our work establishes a foundation for systematic progress toward more pluralistic LLMs.

Benchmarking Overton Pluralism in LLMs

TL;DR

The paper defines OvertonScore to quantify how well LLM outputs represent diverse viewpoints within the Overton window, and validates it via a large-scale human study (N=1209, 60 questions, 8 LLMs) plus an automated benchmark that correlates highly with human judgments (ρ=0.88). Findings show current models achieve roughly 0.35–0.41 unweighted and 0.48 weighted coverage, far from the ideal 1.0, indicating substantial room for improvement in pluralistic alignment. The automated LLM-judge benchmark (Gemini 2.5 Pro FS+FR) offers scalable, near-human predictive fidelity (MAE, ρ) to accelerate model development while preserving evaluation integrity. Together, these contributions offer a principled, scalable path toward building LLMs that more faithfully represent a spectrum of legitimate public viewpoints.

Abstract

We introduce a novel framework for measuring Overton pluralism in LLMs--the extent to which diverse viewpoints are represented in model outputs. We (i) formalize Overton pluralism as a set coverage metric (OvertonScore), (ii) conduct a large-scale U.S.-representative human study (N = 1209; 60 questions; 8 LLMs), and (iii) develop an automated benchmark that closely reproduces human judgments. On average, models achieve OvertonScores of 0.35--0.41, with DeepSeek V3 performing best; yet all models remain far below the theoretical maximum of 1.0, revealing substantial headroom for improvement. Because repeated large-scale human studies are costly and slow, scalable evaluation tools are essential for model development. Hence, we propose an automated benchmark that achieves high rank correlation with human judgments (), providing a practical proxy without replacing human assessment. By turning pluralistic alignment from a normative aim into a measurable benchmark, our work establishes a foundation for systematic progress toward more pluralistic LLMs.

Paper Structure

This paper contains 47 sections, 9 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Overview of our benchmark for quantifying Overton pluralism. We cluster survey participants into distinct viewpoints on subjective questions and measure whether each group feels represented in a model’s response. The OvertonScoreOvertonScore is the fraction of viewpoints adequately represented (✓); its weighted variant additionally accounts for each group's prevalence. Shown here for a carbon-emissions question: GPT o4-mini represents only the majority pro-regulation view, Llama 4 Maverick represents the minority “balance economy” view, while a hypothetical pluralistic model covers all viewpoints (score = 1.0). Model responses are real excerpts, abbreviated for clarity.
  • Figure 2: Benchmark results comparing the adjusted OvertonScoreOvertonScores and weighted OvertonScore$_W$OvertonScore$_W$s with 95% question-level bootstrap CIs (CIs are comparable only within each metric variant). When the weighted performance is better than the unweighted, it indicates the covered viewpoints represent a large number of people.
  • Figure 3: Mean absolute error (MAE) of the best performing LLM prediction method (green): Gemini 2.5 Pro with the Few-Shot + Free Response text (FS+FR). Blue bars show baseline performance. 95% confidence intervals are calculated via nonparametric bootstrap.
  • Figure 4: Pairwise win--rate heatmap ( OvertonScoreOvertonScore). Values close to 1 indicate that the row model consistently outperforms the column across $\tau$; values near 0 imply the reverse. Values near 0.5 indicate variable orderings.
  • Figure 6: Average accuracy, MAE, and MSE among baselines and Gemini Pro LLM judge across prompting methods in full study. The Few-Shot method generally outperforms all other methods across metrics except the Semantic Similarity. Higher accuracy and lower MAE/MSE is considered better. The error bars are 95% confidence intervals estimated via bootstrapping.
  • ...and 8 more figures