Approximating Human Strategic Reasoning with LLM-Enhanced Recursive Reasoners Leveraging Multi-agent Hypergames
Vince Trencsenyi, Agnieszka Mensfelt, Kostas Stathis
TL;DR
This work tackles the evaluation of recursive strategic reasoning in LLM-driven multi-agent systems by introducing a centralized umpire framework and hypergame-based reasoning to enable deep, multi-level belief modeling in two-player beauty contest tasks. The authors propose a κ-based measure to complement traditional k-level reasoning and demonstrate how LLM-enhanced agents can outperform a cognitive hierarchy baseline and approximate human data in structured game settings. Through two experiments with profiling, they show that professional-domain prompts and agent profiles modulate reasoning depth, with some models approaching human performance yet not consistently achieving the optimal solution. The key contributions are a flexible, agency-rich MAS platform, the κ metric for reasoning depth, and empirical evidence that artificial reasoners can closely track human behavior and sometimes exceed baseline performance, informing the design of future LLM-based strategic agents. This work advances systematic evaluation of LLM reasoning in strategic contexts and motivates semantic methods to assess the quality of inferred reasoning processes.
Abstract
LLM-driven multi-agent-based simulations have been gaining traction with applications in game-theoretic and social simulations. While most implementations seek to exploit or evaluate LLM-agentic reasoning, they often do so with a weak notion of agency and simplified architectures. We implement a role-based multi-agent strategic interaction framework tailored to sophisticated recursive reasoners, providing the means for systematic in-depth development and evaluation of strategic reasoning. Our game environment is governed by the umpire responsible for facilitating games, from matchmaking through move validation to environment management. Players incorporate state-of-the-art LLMs in their decision mechanism, relying on a formal hypergame-based model of hierarchical beliefs. We use one-shot, 2-player beauty contests to evaluate the recursive reasoning capabilities of the latest LLMs, providing a comparison to an established baseline model from economics and data from human experiments. Furthermore, we introduce the foundations of an alternative semantic measure of reasoning to the k-level theory. Our experiments show that artificial reasoners can outperform the baseline model in terms of both approximating human behaviour and reaching the optimal solution.
