PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles

Yitao Long; Yuru Jiang; Hongjun Liu; Yilun Zhao; Jingchen Sun; Yiqiu Shen; Chen Zhao; Arman Cohan; Dennis Shasha

PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles

Yitao Long, Yuru Jiang, Hongjun Liu, Yilun Zhao, Jingchen Sun, Yiqiu Shen, Chen Zhao, Arman Cohan, Dennis Shasha

TL;DR

PuzzlePlex introduces a scalable benchmark to probe foundation models on long-horizon reasoning and planning through 15 diverse, rule-based puzzles spanning text and image modalities. It evaluates both interactive instruction-based reasoning and autonomous code-based execution, employing a modular puzzle-generation framework, fine-grained metrics, and a mix of handcrafted strategies. Key findings show reasoning-enabled models excel in instruction-based scenarios and demonstrate favorable scaling with deliberation, while code-based puzzle solving remains harder due to program synthesis demands, though it offers efficiency gains via reusable code. The benchmark also highlights the strong performance of open-source models, the value of legality-aware prompting, and the ongoing challenge of multi-hop reasoning, pointing to clear avenues for future enhancements in generalization and cross-modal strategic reasoning.

Abstract

This work investigates the reasoning and planning capabilities of foundation models and their scalability in complex, dynamic environments. We introduce PuzzlePlex, a benchmark designed to assess these capabilities through a diverse set of puzzles. PuzzlePlex consists of 15 types of puzzles, including deterministic and stochastic games of varying difficulty, as well as single-player and two-player scenarios. The PuzzlePlex framework provides a comprehensive environment for each game, and supports extensibility to generate more challenging instances as foundation models evolve. Additionally, we implement customized game-playing strategies for comparison. Building on this benchmark, we develop fine-grained metrics to measure performance and conduct an in-depth analysis of frontier foundation models across two settings: instruction-based and code-based. Furthermore, we systematically investigate their scaling limits. Our findings show that reasoning models outperform others in instruction-based settings, while code-based execution presents greater challenges but offers a scalable and efficient alternative. PuzzlePlex enables targeted evaluation and guides future improvements in reasoning, planning, and generalization for foundation models.

PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles

TL;DR

Abstract

PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (24)