Table of Contents
Fetching ...

Framing the Game: How Context Shapes LLM Decision-Making

Isaac Robinson, John Burden

TL;DR

The paper investigates how contextual framing shapes LLM decision-making in the Prisoner’s Dilemma and introduces a dynamic, procedurally generated evaluation framework that systematically varies vignettes across topic, world type, and actor relations while preserving the same underlying $2\times2$ payoff structure. It demonstrates substantial context-driven variance in cooperation decisions across GPT-4o, Claude, and Llama, with results showing that framing effects are largely predictable yet model- and context-dependent, and that order biases further modulate responses. By extending the Factorial Survey approach with dynamic vignette generation and a story generator, the work provides a scalable, contamination-resistant methodology for robust LLM evaluation in real-world decision contexts. The findings underscore the limitations of static benchmarks for assessing decision-making in open-domain models and offer practical guidance for dynamic, context-aware evaluation strategies, including open-source tooling to enable broader adoption. Together, these contributions advance understanding of how framing and narrative context influence LLM behavior and inform more reliable deployment practices in decision-critical settings.

Abstract

Large Language Models (LLMs) are increasingly deployed across diverse contexts to support decision-making. While existing evaluations effectively probe latent model capabilities, they often overlook the impact of context framing on perceived rational decision-making. In this study, we introduce a novel evaluation framework that systematically varies evaluation instances across key features and procedurally generates vignettes to create highly varied scenarios. By analyzing decision-making patterns across different contexts with the same underlying game structure, we uncover significant contextual variability in LLM responses. Our findings demonstrate that this variability is largely predictable yet highly sensitive to framing effects. Our results underscore the need for dynamic, context-aware evaluation methodologies for real-world deployments.

Framing the Game: How Context Shapes LLM Decision-Making

TL;DR

The paper investigates how contextual framing shapes LLM decision-making in the Prisoner’s Dilemma and introduces a dynamic, procedurally generated evaluation framework that systematically varies vignettes across topic, world type, and actor relations while preserving the same underlying payoff structure. It demonstrates substantial context-driven variance in cooperation decisions across GPT-4o, Claude, and Llama, with results showing that framing effects are largely predictable yet model- and context-dependent, and that order biases further modulate responses. By extending the Factorial Survey approach with dynamic vignette generation and a story generator, the work provides a scalable, contamination-resistant methodology for robust LLM evaluation in real-world decision contexts. The findings underscore the limitations of static benchmarks for assessing decision-making in open-domain models and offer practical guidance for dynamic, context-aware evaluation strategies, including open-source tooling to enable broader adoption. Together, these contributions advance understanding of how framing and narrative context influence LLM behavior and inform more reliable deployment practices in decision-critical settings.

Abstract

Large Language Models (LLMs) are increasingly deployed across diverse contexts to support decision-making. While existing evaluations effectively probe latent model capabilities, they often overlook the impact of context framing on perceived rational decision-making. In this study, we introduce a novel evaluation framework that systematically varies evaluation instances across key features and procedurally generates vignettes to create highly varied scenarios. By analyzing decision-making patterns across different contexts with the same underlying game structure, we uncover significant contextual variability in LLM responses. Our findings demonstrate that this variability is largely predictable yet highly sensitive to framing effects. Our results underscore the need for dynamic, context-aware evaluation methodologies for real-world deployments.

Paper Structure

This paper contains 22 sections, 2 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Payoff matrix exhibiting strict dominance of Defect
  • Figure 2: An example vignette produced by the story generator.
  • Figure 3: Overview of the generative evaluation process.
  • Figure 4: Distribution of decisions made by the different models.
  • Figure 5: Agreement by topic and actor type across Llama, Claude and GPT-4o. Error bars represent a 95% CI.
  • ...and 5 more figures