Table of Contents
Fetching ...

Predicting Field Experiments with Large Language Models

Yaoyu Chen, Yuheng Hu, Yingda Lu

TL;DR

The paper addresses the feasibility of predicting real-world field-experiment outcomes using large language models. It introduces an automated three-stage framework (extraction, variant generation, prediction) and demonstrates 78% average accuracy across 276 economics field experiments (1261 conclusions), using non-fine-tuned LLMs with Chain-of-Thought prompts. Key findings include the beneficial effect of CoT prompting, model-iteration gains, and characteristic bimodal/skewed prediction distributions; it also reveals boundary conditions tied to topics like ethnicity and ethical dilemmas that limit predictability. The work offers a scalable, automated alternative for pilot-testing field experiments and clarifies practical constraints for deploying LLM-based experimental simulations in social science research.

Abstract

Large language models (LLMs) have demonstrated unprecedented emergent capabilities, including content generation, translation, and simulation of human behavior. Field experiments, on the other hand, are widely employed in social studies to examine real-world human behavior through carefully designed manipulations and treatments. However, field experiments are known to be expensive and time consuming. Therefore, an interesting question is whether and how LLMs can be utilized for field experiments. In this paper, we propose and evaluate an automated LLM-based framework to predict the outcomes of a field experiment. Applying this framework to 276 experiments about a wide range of human behaviors drawn from renowned economics literature yields a prediction accuracy of 78%. Moreover, we find that the distributions of the results are either bimodal or highly skewed. By investigating this abnormality further, we identify that field experiments related to complex social issues such as ethnicity, social norms, and ethical dilemmas can pose significant challenges to the prediction performance.

Predicting Field Experiments with Large Language Models

TL;DR

The paper addresses the feasibility of predicting real-world field-experiment outcomes using large language models. It introduces an automated three-stage framework (extraction, variant generation, prediction) and demonstrates 78% average accuracy across 276 economics field experiments (1261 conclusions), using non-fine-tuned LLMs with Chain-of-Thought prompts. Key findings include the beneficial effect of CoT prompting, model-iteration gains, and characteristic bimodal/skewed prediction distributions; it also reveals boundary conditions tied to topics like ethnicity and ethical dilemmas that limit predictability. The work offers a scalable, automated alternative for pilot-testing field experiments and clarifies practical constraints for deploying LLM-based experimental simulations in social science research.

Abstract

Large language models (LLMs) have demonstrated unprecedented emergent capabilities, including content generation, translation, and simulation of human behavior. Field experiments, on the other hand, are widely employed in social studies to examine real-world human behavior through carefully designed manipulations and treatments. However, field experiments are known to be expensive and time consuming. Therefore, an interesting question is whether and how LLMs can be utilized for field experiments. In this paper, we propose and evaluate an automated LLM-based framework to predict the outcomes of a field experiment. Applying this framework to 276 experiments about a wide range of human behaviors drawn from renowned economics literature yields a prediction accuracy of 78%. Moreover, we find that the distributions of the results are either bimodal or highly skewed. By investigating this abnormality further, we identify that field experiments related to complex social issues such as ethnicity, social norms, and ethical dilemmas can pose significant challenges to the prediction performance.

Paper Structure

This paper contains 20 sections, 3 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: The Data Collection Workflow.
  • Figure 2: Prediction Framework.
  • Figure 3: Paper Accuracy by Year.
  • Figure 4: Conclusion Prediction Accuracy Distribution
  • Figure 5: Paper Prediction Accuracy Distribution
  • ...and 9 more figures