Sudoku-Bench: Evaluating creative reasoning with Sudoku variants
Jeffrey Seely, Yuki Imajuku, Tianyu Zhao, Edoardo Cetin, Llion Jones
TL;DR
Sudoku-Bench addresses the gap where large language models struggle to demonstrate genuine creative reasoning by focusing on Sudoku variants with break-ins—novel, multi-step constraints that resist memorization. It introduces open-source tooling around SudokuPad, a text-based puzzle representation, and a curated 100-puzzle benchmark spanning $4 × 4$, $6 × 6$, and $9 × 9$ grids, designed to probe long-horizon deduction and constraint interaction. The framework supports both multi-step and single-shot evaluation and includes expert reasoning traces to enable imitation learning, with baseline results showing less than 15% solve rates without tool use, especially on larger grids. Sudoku-Bench thus provides a controlled, extensible platform for advancing creative reasoning research, supported by rich human transcripts and open-source integrations for broad adoption and further study.
Abstract
Existing reasoning benchmarks for large language models (LLMs) frequently fail to capture authentic creativity, often rewarding memorization of previously observed patterns. We address this shortcoming with Sudoku-Bench, a curated benchmark of challenging and unconventional Sudoku variants specifically selected to evaluate creative, multi-step logical reasoning. Sudoku variants form an unusually effective domain for reasoning research: each puzzle introduces unique or subtly interacting constraints, making memorization infeasible and requiring solvers to identify novel logical breakthroughs (``break-ins''). Despite their diversity, Sudoku variants maintain a common and compact structure, enabling clear and consistent evaluation. Sudoku-Bench includes a carefully chosen puzzle set, a standardized text-based puzzle representation, and flexible tools compatible with thousands of publicly available puzzles -- making it easy to extend into a general research environment. Baseline experiments show that state-of-the-art LLMs solve fewer than 15\% of puzzles unaided, highlighting significant opportunities to advance long-horizon, strategic reasoning capabilities.
