Framing the Game: How Context Shapes LLM Decision-Making
Isaac Robinson, John Burden
TL;DR
The paper investigates how contextual framing shapes LLM decision-making in the Prisoner’s Dilemma and introduces a dynamic, procedurally generated evaluation framework that systematically varies vignettes across topic, world type, and actor relations while preserving the same underlying $2\times2$ payoff structure. It demonstrates substantial context-driven variance in cooperation decisions across GPT-4o, Claude, and Llama, with results showing that framing effects are largely predictable yet model- and context-dependent, and that order biases further modulate responses. By extending the Factorial Survey approach with dynamic vignette generation and a story generator, the work provides a scalable, contamination-resistant methodology for robust LLM evaluation in real-world decision contexts. The findings underscore the limitations of static benchmarks for assessing decision-making in open-domain models and offer practical guidance for dynamic, context-aware evaluation strategies, including open-source tooling to enable broader adoption. Together, these contributions advance understanding of how framing and narrative context influence LLM behavior and inform more reliable deployment practices in decision-critical settings.
Abstract
Large Language Models (LLMs) are increasingly deployed across diverse contexts to support decision-making. While existing evaluations effectively probe latent model capabilities, they often overlook the impact of context framing on perceived rational decision-making. In this study, we introduce a novel evaluation framework that systematically varies evaluation instances across key features and procedurally generates vignettes to create highly varied scenarios. By analyzing decision-making patterns across different contexts with the same underlying game structure, we uncover significant contextual variability in LLM responses. Our findings demonstrate that this variability is largely predictable yet highly sensitive to framing effects. Our results underscore the need for dynamic, context-aware evaluation methodologies for real-world deployments.
