Table of Contents
Fetching ...

PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles

Yitao Long, Yuru Jiang, Hongjun Liu, Yilun Zhao, Jingchen Sun, Yiqiu Shen, Chen Zhao, Arman Cohan, Dennis Shasha

TL;DR

PuzzlePlex introduces a scalable benchmark to probe foundation models on long-horizon reasoning and planning through 15 diverse, rule-based puzzles spanning text and image modalities. It evaluates both interactive instruction-based reasoning and autonomous code-based execution, employing a modular puzzle-generation framework, fine-grained metrics, and a mix of handcrafted strategies. Key findings show reasoning-enabled models excel in instruction-based scenarios and demonstrate favorable scaling with deliberation, while code-based puzzle solving remains harder due to program synthesis demands, though it offers efficiency gains via reusable code. The benchmark also highlights the strong performance of open-source models, the value of legality-aware prompting, and the ongoing challenge of multi-hop reasoning, pointing to clear avenues for future enhancements in generalization and cross-modal strategic reasoning.

Abstract

This work investigates the reasoning and planning capabilities of foundation models and their scalability in complex, dynamic environments. We introduce PuzzlePlex, a benchmark designed to assess these capabilities through a diverse set of puzzles. PuzzlePlex consists of 15 types of puzzles, including deterministic and stochastic games of varying difficulty, as well as single-player and two-player scenarios. The PuzzlePlex framework provides a comprehensive environment for each game, and supports extensibility to generate more challenging instances as foundation models evolve. Additionally, we implement customized game-playing strategies for comparison. Building on this benchmark, we develop fine-grained metrics to measure performance and conduct an in-depth analysis of frontier foundation models across two settings: instruction-based and code-based. Furthermore, we systematically investigate their scaling limits. Our findings show that reasoning models outperform others in instruction-based settings, while code-based execution presents greater challenges but offers a scalable and efficient alternative. PuzzlePlex enables targeted evaluation and guides future improvements in reasoning, planning, and generalization for foundation models.

PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles

TL;DR

PuzzlePlex introduces a scalable benchmark to probe foundation models on long-horizon reasoning and planning through 15 diverse, rule-based puzzles spanning text and image modalities. It evaluates both interactive instruction-based reasoning and autonomous code-based execution, employing a modular puzzle-generation framework, fine-grained metrics, and a mix of handcrafted strategies. Key findings show reasoning-enabled models excel in instruction-based scenarios and demonstrate favorable scaling with deliberation, while code-based puzzle solving remains harder due to program synthesis demands, though it offers efficiency gains via reusable code. The benchmark also highlights the strong performance of open-source models, the value of legality-aware prompting, and the ongoing challenge of multi-hop reasoning, pointing to clear avenues for future enhancements in generalization and cross-modal strategic reasoning.

Abstract

This work investigates the reasoning and planning capabilities of foundation models and their scalability in complex, dynamic environments. We introduce PuzzlePlex, a benchmark designed to assess these capabilities through a diverse set of puzzles. PuzzlePlex consists of 15 types of puzzles, including deterministic and stochastic games of varying difficulty, as well as single-player and two-player scenarios. The PuzzlePlex framework provides a comprehensive environment for each game, and supports extensibility to generate more challenging instances as foundation models evolve. Additionally, we implement customized game-playing strategies for comparison. Building on this benchmark, we develop fine-grained metrics to measure performance and conduct an in-depth analysis of frontier foundation models across two settings: instruction-based and code-based. Furthermore, we systematically investigate their scaling limits. Our findings show that reasoning models outperform others in instruction-based settings, while code-based execution presents greater challenges but offers a scalable and efficient alternative. PuzzlePlex enables targeted evaluation and guides future improvements in reasoning, planning, and generalization for foundation models.

Paper Structure

This paper contains 50 sections, 24 figures, 17 tables.

Figures (24)

  • Figure 1: Overview of four puzzles: SudoKill (two-player deterministic), Tidy Tower (single-player deterministic), Beat or Bomb Sto (two-player stochastic), and Ruby Risks (single-player stochastic).
  • Figure 2: Overview of the developed pipeline framework. Puzzle Generator creates puzzle instances from templates based on the puzzle name, difficulty level, and selected competing models. The Solver then generates a response after receiving the puzzle instance. This response is passed to the Transition Checker, which verifies the legality of the operation output by the Solver and checks the game status. If the game ends, the Evaluator calculates and outputs the score. Otherwise, State Transition updates the state and passes the updated information back to the Solver.
  • Figure 3: Comparison between the reasoning model Deepseek-R1 and the non-reasoning model Deepseek-V3 in terms of generated token counts versus normalized scores on single-player deterministic puzzles.
  • Figure 5: Description of SudoKill.
  • Figure 6: Description of TidyTower.
  • ...and 19 more figures