TextGames: Learning to Self-Play Text-Based Puzzle Games via Language Model Reasoning
Frederikus Hudi, Genta Indra Winata, Ruochen Zhang, Alham Fikri Aji
TL;DR
TextGames introduces a comprehensive benchmark to stress-test LLM reasoning on demanding text-based puzzles that combine pattern recognition, spatial awareness, arithmetic, and logic. By presenting eight puzzles across 1D and 2D formats with three difficulty levels and a multi-turn self-reflection workflow, the work systematically evaluates how prompting and feedback influence performance. Findings show that while LLMs can solve easy and medium tasks, hard tasks remain challenging, though reasoning-focused models and multi-turn prompting boost results, with some inverse-scaling observed for longer reasoning. Humans consistently outperform LLMs on hard tasks, underscoring gaps in current reasoning capabilities and the value of self-reflective mechanisms; TextGames provides a reusable platform and dataset for advancing constrained text-based reasoning research.
Abstract
Reasoning is a fundamental capability of large language models (LLMs), enabling them to comprehend, analyze, and solve complex problems. In this paper, we introduce TextGames, an innovative benchmark specifically crafted to assess LLMs through demanding text-based games that require advanced skills in pattern recognition, spatial awareness, arithmetic, and logical reasoning. Our analysis probes LLMs' performance in both single-turn and multi-turn reasoning, and their abilities in leveraging feedback to correct subsequent answers through self-reflection. Our findings reveal that, although LLMs exhibit proficiency in addressing most easy and medium-level problems, they face significant challenges with more difficult tasks. In contrast, humans are capable of solving all tasks when given sufficient time. Moreover, we observe that LLMs show improved performance in multi-turn predictions through self-reflection, yet they still struggle with sequencing, counting, and following complex rules consistently. Additionally, models optimized for reasoning outperform pre-trained LLMs that prioritize instruction following, highlighting the crucial role of reasoning skills in addressing highly complex problems.
