Table of Contents
Fetching ...

Can LLMs Replace Economic Choice Prediction Labs? The Case of Language-based Persuasion Games

Eilam Shapira, Omer Madmon, Roi Reichart, Moshe Tennenholtz

TL;DR

The paper demonstrates that large language models can generate synthetic data that train predictors of human choices in language-based persuasion games, sometimes outperforming predictors trained on actual human data when sample sizes are large enough. It further shows that fine-tuning LLMs on human data can yield direct predictors and high-quality data generators, with a calibration–accuracy trade-off that can be mitigated by the Double Use of human data for Augmented Learning (DUAL). Crucially, the study highlights history as a central driver of human decision-making in repeated interactions, showing that history-based patterns enable more accurate predictions than sentiment cues alone. The findings suggest a scalable, data-efficient path for modeling human decision-making in linguistically rich economic settings, with broad implications for synthetic data generation, model calibration, and ethical considerations in AI-driven behavioral research.

Abstract

Human choice prediction in economic contexts is crucial for applications in marketing, finance, public policy, and more. This task, however, is often constrained by the difficulties in acquiring human choice data. With most experimental economics studies focusing on simple choice settings, the AI community has explored whether LLMs can substitute for humans in these predictions and examined more complex experimental economics settings. However, a key question remains: can LLMs generate training data for human choice prediction? We explore this in language-based persuasion games, a complex economic setting involving natural language in strategic interactions. Our experiments show that models trained on LLM-generated data can effectively predict human behavior in these games and even outperform models trained on actual human data. Beyond data generation, we investigate the dual role of LLMs as both data generators and predictors, introducing a comprehensive empirical study on the effectiveness of utilizing LLMs for data generation, human choice prediction, or both. We then utilize our choice prediction framework to analyze how strategic factors shape decision-making, showing that interaction history (rather than linguistic sentiment alone) plays a key role in predicting human decision-making in repeated interactions. Particularly, when LLMs capture history-dependent decision patterns similarly to humans, their predictive success improves substantially. Finally, we demonstrate the robustness of our findings across alternative persuasion-game settings, highlighting the broader potential of using LLM-generated data to model human decision-making.

Can LLMs Replace Economic Choice Prediction Labs? The Case of Language-based Persuasion Games

TL;DR

The paper demonstrates that large language models can generate synthetic data that train predictors of human choices in language-based persuasion games, sometimes outperforming predictors trained on actual human data when sample sizes are large enough. It further shows that fine-tuning LLMs on human data can yield direct predictors and high-quality data generators, with a calibration–accuracy trade-off that can be mitigated by the Double Use of human data for Augmented Learning (DUAL). Crucially, the study highlights history as a central driver of human decision-making in repeated interactions, showing that history-based patterns enable more accurate predictions than sentiment cues alone. The findings suggest a scalable, data-efficient path for modeling human decision-making in linguistically rich economic settings, with broad implications for synthetic data generation, model calibration, and ethical considerations in AI-driven behavioral research.

Abstract

Human choice prediction in economic contexts is crucial for applications in marketing, finance, public policy, and more. This task, however, is often constrained by the difficulties in acquiring human choice data. With most experimental economics studies focusing on simple choice settings, the AI community has explored whether LLMs can substitute for humans in these predictions and examined more complex experimental economics settings. However, a key question remains: can LLMs generate training data for human choice prediction? We explore this in language-based persuasion games, a complex economic setting involving natural language in strategic interactions. Our experiments show that models trained on LLM-generated data can effectively predict human behavior in these games and even outperform models trained on actual human data. Beyond data generation, we investigate the dual role of LLMs as both data generators and predictors, introducing a comprehensive empirical study on the effectiveness of utilizing LLMs for data generation, human choice prediction, or both. We then utilize our choice prediction framework to analyze how strategic factors shape decision-making, showing that interaction history (rather than linguistic sentiment alone) plays a key role in predicting human decision-making in repeated interactions. Particularly, when LLMs capture history-dependent decision patterns similarly to humans, their predictive success improves substantially. Finally, we demonstrate the robustness of our findings across alternative persuasion-game settings, highlighting the broader potential of using LLM-generated data to model human decision-making.
Paper Structure (44 sections, 17 figures, 6 tables)

This paper contains 44 sections, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Results for the prediction task introduced in Section \ref{['sec:task def']}, comparing alternative ways to use data from the 110 human players and from the LLM-generated players. Left: a simple ML predictor (see Section \ref{['sec:main res']}) trained under three regimes—human-only data, LLM-generated data, and a mixed dataset. Middle: using the LLM itself as the predictor (see Section \ref{['sec:tuning']}): an off-the-shelf LLM versus the same LLM fine-tuned on the human data. Right: fine-tuning an LLM and then using it to generate training data for a simple predictor (see Section \ref{['sec:tuning']}). In the Double Use of human data for Augmented Learning (DUAL) variant, the human data used for fine-tuning is reused to train the simple ML predictor. While DUAL does not improve accuracy over Fine-Tuning, it nearly halves the expected calibration error.
  • Figure 2: Left: Illustration of a single round in the language-based persuasion game. First, the expert observes the interaction history of previous rounds (does not appear in the illustration), as well as the current hotel's review-score pairs. She chooses a single review within this set according to her predefined strategy and sends it to the DM. Then, the DM observes this review (as well as the entire interaction history) and chooses an action. Lastly, both agents get their payoffs based on the DM's action and the hotel's true quality. Right: An example of an expert strategy.
  • Figure 3: A sample review from the hotel reviews dataset.
  • Figure 4: Number of LLM-generated players (using Chat-Bison) required for achieving different levels of accuracy (with the LSTM predictor), with persona diversification (various personas) and without persona diversification ('default' persona only). Interestingly, note that the higher the desired accuracy, the larger the gap between the required sample sizes of the two methods.
  • Figure 5: Accuracy obtained by prediction models trained on different data sources. Grey lines represent the accuracy obtained by a model trained on human data with a different number of players. Results are shown for LSTM, transformer, Mamba and XGBoost. Notably, for all prediction models training on LLM-generated data outperforms training on actual human choice data when the number of LLM players is large enough. In addition, the LLM-based training paradigm outperforms the sentiment analysis baseline, implying that allowing simulated players to determine behavior (and not just to interpret the textual signal) yields a better predictor.
  • ...and 12 more figures