Table of Contents
Fetching ...

Knowing Your Uncertainty -- On the application of LLM in social sciences

Bolun Zhang, Linzhuo Li, Yunqi Chen, Qinlin Zhao, Zihan Zhu, Xiaoyuan Yi, Xing Xie

TL;DR

This paper addresses the critical challenge of uncertainty when applying LLMs to social science research. It introduces a two-dimensional task-validation framework (T,V) to tailor uncertainty quantification to specific tasks and data availability, integrating epistemic and aleatoric sources and treating prompts as a latent space. Through multiple illustrative studies (sentiment analysis, topic labeling, exploratory coding, historical counterfactuals), it demonstrates the utility and limits of various UQ methods and emphasizes that metric choice should be driven by task and validation context. The authors advocate an uncertainty-first workflow, open-source tooling, and careful research design to prevent overclaiming and to ensure rigorous, replicable social-science insights.

Abstract

Large language models (LLMs) are rapidly being integrated into computational social science research, yet their blackboxed training and designed stochastic elements in inference pose unique challenges for scientific inquiry. This article argues that applying LLMs to social scientific tasks requires explicit assessment of uncertainty-an expectation long established in both quantitative methodology in the social sciences and machine learning. We introduce a unified framework for evaluating LLM uncertainty along two dimensions: the task type (T), which distinguishes between classification, short-form, and long-form generation, and the validation type (V), which captures the availability of reference data or evaluative criteria. Drawing from both computer science and social science literature, we map existing uncertainty quantification (UQ) methods to this T-V typology and offer practical recommendations for researchers. Our framework provides both a methodological safeguard and a practical guide for integrating LLMs into rigorous social science research.

Knowing Your Uncertainty -- On the application of LLM in social sciences

TL;DR

This paper addresses the critical challenge of uncertainty when applying LLMs to social science research. It introduces a two-dimensional task-validation framework (T,V) to tailor uncertainty quantification to specific tasks and data availability, integrating epistemic and aleatoric sources and treating prompts as a latent space. Through multiple illustrative studies (sentiment analysis, topic labeling, exploratory coding, historical counterfactuals), it demonstrates the utility and limits of various UQ methods and emphasizes that metric choice should be driven by task and validation context. The authors advocate an uncertainty-first workflow, open-source tooling, and careful research design to prevent overclaiming and to ensure rigorous, replicable social-science insights.

Abstract

Large language models (LLMs) are rapidly being integrated into computational social science research, yet their blackboxed training and designed stochastic elements in inference pose unique challenges for scientific inquiry. This article argues that applying LLMs to social scientific tasks requires explicit assessment of uncertainty-an expectation long established in both quantitative methodology in the social sciences and machine learning. We introduce a unified framework for evaluating LLM uncertainty along two dimensions: the task type (T), which distinguishes between classification, short-form, and long-form generation, and the validation type (V), which captures the availability of reference data or evaluative criteria. Drawing from both computer science and social science literature, we map existing uncertainty quantification (UQ) methods to this T-V typology and offer practical recommendations for researchers. Our framework provides both a methodological safeguard and a practical guide for integrating LLMs into rigorous social science research.

Paper Structure

This paper contains 16 sections, 11 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: High Level Conceptualization of LLM's Inference.
  • Figure 2: Comparison of 6 different UQ metrics in topic model labeling and counterfactual historical questions. Topic model labels are generated by GPT-OSS 20b and counterfactual historical questions are generated by Qwen3-32b.
  • Figure 3: Token-level entropy of three models on the sentiment analysis task.
  • Figure 4: Multi-class Brier scores of the three models in the sentiment analysis task.
  • Figure 5: Brier scores of the categories "Positive", "Neutral", and "Negative" across three models in the sentiment analysis task.
  • ...and 5 more figures