Should you use LLMs to simulate opinions? Quality checks for early-stage deliberation
Terrence Neumann, Maria De-Arteaga, Sina Fazelpour
TL;DR
This work evaluates whether large language models can reliably simulate human opinions for early-stage research. It introduces two low-cost quality checks—Logical Consistency and Alignment with Stakeholder Expectations—tailored to Likert-scale tasks and domain knowledge, enabling assessment without large human benchmarks. A domain-specific testbed, TopicMisinfo, targets content moderation and gender-based prioritization, including Gold checks and a small human annotation set. Benchmarking across multiple models and prompting strategies reveals widespread failures to achieve stable, interpretable opinion simulations, highlighting risks in both overestimating and obscuring demographics. The authors release TopicMisinfo and provide risk-management guidance, arguing that these QC checks are essential for deciding whether to invest in more costly human data collection or to pursue alternative approaches.
Abstract
The emergent capabilities of large language models (LLMs) have prompted interest in using them as surrogates for human subjects in opinion surveys. However, prior evaluations of LLM-based opinion simulation have relied heavily on costly, domain-specific survey data, and mixed empirical results leave their reliability in question. To enable cost-effective, early-stage evaluation, we introduce a quality control assessment designed to test the viability of LLM-simulated opinions on Likert-scale tasks without requiring large-scale human data for validation. This assessment comprises two key tests: \emph{logical consistency} and \emph{alignment with stakeholder expectations}, offering a low-cost, domain-adaptable validation tool. We apply our quality control assessment to an opinion simulation task relevant to AI-assisted content moderation and fact-checking workflows -- a socially impactful use case -- and evaluate seven LLMs using a baseline prompt engineering method (backstory prompting), as well as fine-tuning and in-context learning variants. None of the models or methods pass the full assessment, revealing several failure modes. We conclude with a discussion of the risk management implications and release \texttt{TopicMisinfo}, a benchmark dataset with paired human and LLM annotations simulated by various models and approaches, to support future research.
