Table of Contents
Fetching ...

Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina

Yuan Gao, Dokyun Lee, Gordon Burtch, Sina Fazelpour

Abstract

Recent studies suggest large language models (LLMs) can exhibit human-like reasoning, aligning with human behavior in economic experiments, surveys, and political discourse. This has led many to propose that LLMs can be used as surrogates or simulations for humans in social science research. However, LLMs differ fundamentally from humans, relying on probabilistic patterns, absent the embodied experiences or survival objectives that shape human cognition. We assess the reasoning depth of LLMs using the 11-20 money request game. Nearly all advanced approaches fail to replicate human behavior distributions across many models. Causes of failure are diverse and unpredictable, relating to input language, roles, and safeguarding. These results advise caution when using LLMs to study human behavior or as surrogates or simulations.

Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina

Abstract

Recent studies suggest large language models (LLMs) can exhibit human-like reasoning, aligning with human behavior in economic experiments, surveys, and political discourse. This has led many to propose that LLMs can be used as surrogates or simulations for humans in social science research. However, LLMs differ fundamentally from humans, relying on probabilistic patterns, absent the embodied experiences or survival objectives that shape human cognition. We assess the reasoning depth of LLMs using the 11-20 money request game. Nearly all advanced approaches fail to replicate human behavior distributions across many models. Causes of failure are diverse and unpredictable, relating to input language, roles, and safeguarding. These results advise caution when using LLMs to study human behavior or as surrogates or simulations.

Paper Structure

This paper contains 16 sections, 1 equation, 10 figures, 5 tables.

Figures (10)

  • Figure 1: 11-20 Money Request Game. The bar chart on the right shows the similarity between the distribution of different subjects and human subjects, measured by Jensen-Shannon divergence scores. Density plots are omitted for subjects with over 98% of the data concentrated in a single choice to avoid potential misinterpretation.
  • Figure 2: Prompt Brittleness: Roles and Languages. The bar chart on the right shows the similarity between the distribution of different subjects and human subjects, measured by Jensen-Shannon divergence scores. Density plots are omitted for subjects with over 98% of the data concentrated in a single choice to avoid potential misinterpretation.
  • Figure 3: Few-shot CoT. The shaded gray area represents the sample range we provided. The bar chart on the right shows the similarity between the distribution of different subjects and human subjects, measured by Jensen-Shannon divergence scores. Density plots are omitted for subjects with over 98% of the data concentrated in a single choice to avoid potential misinterpretation.
  • Figure 4: RAG and Fine-tuning. The bar chart on the right shows the similarity between the distribution of different subjects and human subjects, measured by Jensen-Shannon divergence scores. Density plots are omitted for subjects with over 98% of the data concentrated in a single choice to avoid potential misinterpretation.
  • Figure 5: LLMs' Memorization on the Game instruction. We reach out to LLMs' familiarity with the instruction of the 11-20 Money Request game and guessing game. The prompt here is in plain English: “Tell me the instructions for the 11-20 Money Request Game(2/3 Guessing Game)."
  • ...and 5 more figures