Table of Contents
Fetching ...

A Benchmark for Zero-Shot Belief Inference in Large Language Models

Joseph Malone, Rachith Aiyappa, Byunghwee Lee, Haewoon Kwak, Jisun An, Yong-Yeol Ahn

TL;DR

This work introduces a reproducible, zero-shot benchmark to evaluate how well large language models infer human beliefs across a broad range of domains using Debate.org data. By systematically varying background information through an information-ablation framework and testing nine open-source LLMs with a DSPy-guided prompting scheme, the study reveals that additional context—especially when combining demographic data with known beliefs—consistently improves predictive accuracy, though performance remains modest and domain-dependent. Key findings include the emergence of three category groups with different optimal input signals, the superior performance of the beliefs+demographics condition for most models, and a best-case macro-F1 of 63.5 for Phi-4 under the combined information setting. The results highlight both the capacity and the limits of current LLMs to emulate human belief reasoning, offer a scalable framework for modeling belief systems beyond sociopolitical contexts, and point to ethical considerations and future directions in prompting strategies and cross-domain analyses.

Abstract

Beliefs are central to how humans reason, communicate, and form social connections, yet most computational approaches to studying them remain confined to narrow sociopolitical contexts and rely on fine-tuning for optimal performance. Despite the growing use of large language models (LLMs) across disciplines, how well these systems generalize across diverse belief domains remains unclear. We introduce a systematic, reproducible benchmark that evaluates the ability of LLMs to predict individuals' stances on a wide range of topics in a zero-shot setting using data from an online debate platform. The benchmark includes multiple informational conditions that isolate the contribution of demographic context and known prior beliefs to predictive success. Across several small- to medium-sized models, we find that providing more background information about an individual improves predictive accuracy, but performance varies substantially across belief domains. These findings reveal both the capacity and limitations of current LLMs to emulate human reasoning, advancing the study of machine behavior and offering a scalable framework for modeling belief systems beyond the sociopolitical sphere.

A Benchmark for Zero-Shot Belief Inference in Large Language Models

TL;DR

This work introduces a reproducible, zero-shot benchmark to evaluate how well large language models infer human beliefs across a broad range of domains using Debate.org data. By systematically varying background information through an information-ablation framework and testing nine open-source LLMs with a DSPy-guided prompting scheme, the study reveals that additional context—especially when combining demographic data with known beliefs—consistently improves predictive accuracy, though performance remains modest and domain-dependent. Key findings include the emergence of three category groups with different optimal input signals, the superior performance of the beliefs+demographics condition for most models, and a best-case macro-F1 of 63.5 for Phi-4 under the combined information setting. The results highlight both the capacity and the limits of current LLMs to emulate human belief reasoning, offer a scalable framework for modeling belief systems beyond sociopolitical contexts, and point to ethical considerations and future directions in prompting strategies and cross-domain analyses.

Abstract

Beliefs are central to how humans reason, communicate, and form social connections, yet most computational approaches to studying them remain confined to narrow sociopolitical contexts and rely on fine-tuning for optimal performance. Despite the growing use of large language models (LLMs) across disciplines, how well these systems generalize across diverse belief domains remains unclear. We introduce a systematic, reproducible benchmark that evaluates the ability of LLMs to predict individuals' stances on a wide range of topics in a zero-shot setting using data from an online debate platform. The benchmark includes multiple informational conditions that isolate the contribution of demographic context and known prior beliefs to predictive success. Across several small- to medium-sized models, we find that providing more background information about an individual improves predictive accuracy, but performance varies substantially across belief domains. These findings reveal both the capacity and limitations of current LLMs to emulate human reasoning, advancing the study of machine behavior and offering a scalable framework for modeling belief systems beyond the sociopolitical sphere.

Paper Structure

This paper contains 22 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Cumulative distributions of beliefs per user in the processed DDO dataset: (a) context (training) beliefs and (b) test beliefs. Users with fewer than five total belief statements were excluded
  • Figure 2: The mean macro F1 scores across models for each prompt setting and belief category separated into 3 distinct groups based on each category's top-performing experiment setting
  • Figure 3: The macro F1 scores when models participate in a majority vote for each prompt setting and belief category separated into three distinct groups based on each category's top-performing experiment setting
  • Figure 4: The macro F1 scores for each model when provided amounts of context beliefs under the beliefs and demographics experiment setting