Table of Contents
Fetching ...

TextGames: Learning to Self-Play Text-Based Puzzle Games via Language Model Reasoning

Frederikus Hudi, Genta Indra Winata, Ruochen Zhang, Alham Fikri Aji

TL;DR

TextGames introduces a comprehensive benchmark to stress-test LLM reasoning on demanding text-based puzzles that combine pattern recognition, spatial awareness, arithmetic, and logic. By presenting eight puzzles across 1D and 2D formats with three difficulty levels and a multi-turn self-reflection workflow, the work systematically evaluates how prompting and feedback influence performance. Findings show that while LLMs can solve easy and medium tasks, hard tasks remain challenging, though reasoning-focused models and multi-turn prompting boost results, with some inverse-scaling observed for longer reasoning. Humans consistently outperform LLMs on hard tasks, underscoring gaps in current reasoning capabilities and the value of self-reflective mechanisms; TextGames provides a reusable platform and dataset for advancing constrained text-based reasoning research.

Abstract

Reasoning is a fundamental capability of large language models (LLMs), enabling them to comprehend, analyze, and solve complex problems. In this paper, we introduce TextGames, an innovative benchmark specifically crafted to assess LLMs through demanding text-based games that require advanced skills in pattern recognition, spatial awareness, arithmetic, and logical reasoning. Our analysis probes LLMs' performance in both single-turn and multi-turn reasoning, and their abilities in leveraging feedback to correct subsequent answers through self-reflection. Our findings reveal that, although LLMs exhibit proficiency in addressing most easy and medium-level problems, they face significant challenges with more difficult tasks. In contrast, humans are capable of solving all tasks when given sufficient time. Moreover, we observe that LLMs show improved performance in multi-turn predictions through self-reflection, yet they still struggle with sequencing, counting, and following complex rules consistently. Additionally, models optimized for reasoning outperform pre-trained LLMs that prioritize instruction following, highlighting the crucial role of reasoning skills in addressing highly complex problems.

TextGames: Learning to Self-Play Text-Based Puzzle Games via Language Model Reasoning

TL;DR

TextGames introduces a comprehensive benchmark to stress-test LLM reasoning on demanding text-based puzzles that combine pattern recognition, spatial awareness, arithmetic, and logic. By presenting eight puzzles across 1D and 2D formats with three difficulty levels and a multi-turn self-reflection workflow, the work systematically evaluates how prompting and feedback influence performance. Findings show that while LLMs can solve easy and medium tasks, hard tasks remain challenging, though reasoning-focused models and multi-turn prompting boost results, with some inverse-scaling observed for longer reasoning. Humans consistently outperform LLMs on hard tasks, underscoring gaps in current reasoning capabilities and the value of self-reflective mechanisms; TextGames provides a reusable platform and dataset for advancing constrained text-based reasoning research.

Abstract

Reasoning is a fundamental capability of large language models (LLMs), enabling them to comprehend, analyze, and solve complex problems. In this paper, we introduce TextGames, an innovative benchmark specifically crafted to assess LLMs through demanding text-based games that require advanced skills in pattern recognition, spatial awareness, arithmetic, and logical reasoning. Our analysis probes LLMs' performance in both single-turn and multi-turn reasoning, and their abilities in leveraging feedback to correct subsequent answers through self-reflection. Our findings reveal that, although LLMs exhibit proficiency in addressing most easy and medium-level problems, they face significant challenges with more difficult tasks. In contrast, humans are capable of solving all tasks when given sufficient time. Moreover, we observe that LLMs show improved performance in multi-turn predictions through self-reflection, yet they still struggle with sequencing, counting, and following complex rules consistently. Additionally, models optimized for reasoning outperform pre-trained LLMs that prioritize instruction following, highlighting the crucial role of reasoning skills in addressing highly complex problems.

Paper Structure

This paper contains 40 sections, 8 figures, 27 tables, 1 algorithm.

Figures (8)

  • Figure 1: Single-turn performance on $\textcolor{black}{TextGames}$ games across 1D and 2D Puzzles challenges with varying difficulty levels (top), alongside the improvement in accuracy achieved through increased turn attempts via self-reflection, with the x-axis representing the number of turns (bottom).
  • Figure 2: TextGames Benchmark consists of eight text-based puzzle games, each with unique constraints and gameplay mechanics. The top four games are 1D Puzzles, while the bottom four are 2D Puzzles.
  • Figure 3: LLM Results on $\textcolor{black}{TextGames Benchmark}$ in the one-shot setting. Med indicates Medium-difficulty level. *For GPT-o3 Mini, we present the results from zero-shot setting.
  • Figure 4: LLM performance on the Bracket Game in the one-shot setting, excluding GPT models. The results show that increasing the number of turns generally enhances performance. A similar trend is evident in Crossword Arranger, as shown by Figure \ref{['fig:analysis_multiturn_crossword']} in the Appendix \ref{['sec:multi_turn_results_vis']} showing illustrations from all games
  • Figure 5: In hard games, the test-time scaling of GPT-o3 Mini displays inverse scaling behavior, with longer reasoning traces often leading to incorrect results.
  • ...and 3 more figures