Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing
Maciej Świechowski, Adam Żychowski, Jacek Mańdziuk
TL;DR
The paper investigates the reasoning capabilities of contemporary LLMs within the General Game Playing (GGP) framework by evaluating forward simulation tasks in the Game Description Language (GDL). Using four diverse models, the study analyzes how problem structure, horizon length, and linguistic grounding affect symbolic reasoning, and it introduces obfuscation to isolate semantic cues from structural patterns. Key findings show that strong models achieve high one-step accuracy but struggle with longer-horizon reasoning, with degradation tied to rule depth and complexity; obfuscation generally reduces performance, though some models can maintain symbolic reasoning without meaningful variable names. The work highlights common failure modes such as hallucinated rules, extraneous facts, and formatting errors, and it advocates for structured verification of LLM outputs in critical reasoning tasks. Overall, the results demonstrate meaningful progress in formal symbolic reasoning by LLMs while outlining limitations and directions for robust, trustworthy deployment in structured domains.
Abstract
This paper examines the reasoning capabilities of Large Language Models (LLMs) from a novel perspective, focusing on their ability to operate within formally specified, rule-governed environments. We evaluate four LLMs (Gemini 2.5 Pro and Flash variants, Llama 3.3 70B and GPT-OSS 120B) on a suite of forward-simulation tasks-including next / multistep state formulation, and legal action generation-across a diverse set of reasoning problems illustrated through General Game Playing (GGP) game instances. Beyond reporting instance-level performance, we characterize games based on 40 structural features and analyze correlations between these features and LLM performance. Furthermore, we investigate the effects of various game obfuscations to assess the role of linguistic semantics in game definitions and the impact of potential prior exposure of LLMs to specific games during training. The main results indicate that three of the evaluated models generally perform well across most experimental settings, with performance degradation observed as the evaluation horizon increases (i.e., with a higher number of game steps). Detailed case-based analysis of the LLM performance provides novel insights into common reasoning errors in the considered logic-based problem formulation, including hallucinated rules, redundant state facts, or syntactic errors. Overall, the paper reports clear progress in formal reasoning capabilities of contemporary models.
