Table of Contents
Fetching ...

Sudoku-Bench: Evaluating creative reasoning with Sudoku variants

Jeffrey Seely, Yuki Imajuku, Tianyu Zhao, Edoardo Cetin, Llion Jones

TL;DR

Sudoku-Bench addresses the gap where large language models struggle to demonstrate genuine creative reasoning by focusing on Sudoku variants with break-ins—novel, multi-step constraints that resist memorization. It introduces open-source tooling around SudokuPad, a text-based puzzle representation, and a curated 100-puzzle benchmark spanning $4 × 4$, $6 × 6$, and $9 × 9$ grids, designed to probe long-horizon deduction and constraint interaction. The framework supports both multi-step and single-shot evaluation and includes expert reasoning traces to enable imitation learning, with baseline results showing less than 15% solve rates without tool use, especially on larger grids. Sudoku-Bench thus provides a controlled, extensible platform for advancing creative reasoning research, supported by rich human transcripts and open-source integrations for broad adoption and further study.

Abstract

Existing reasoning benchmarks for large language models (LLMs) frequently fail to capture authentic creativity, often rewarding memorization of previously observed patterns. We address this shortcoming with Sudoku-Bench, a curated benchmark of challenging and unconventional Sudoku variants specifically selected to evaluate creative, multi-step logical reasoning. Sudoku variants form an unusually effective domain for reasoning research: each puzzle introduces unique or subtly interacting constraints, making memorization infeasible and requiring solvers to identify novel logical breakthroughs (``break-ins''). Despite their diversity, Sudoku variants maintain a common and compact structure, enabling clear and consistent evaluation. Sudoku-Bench includes a carefully chosen puzzle set, a standardized text-based puzzle representation, and flexible tools compatible with thousands of publicly available puzzles -- making it easy to extend into a general research environment. Baseline experiments show that state-of-the-art LLMs solve fewer than 15\% of puzzles unaided, highlighting significant opportunities to advance long-horizon, strategic reasoning capabilities.

Sudoku-Bench: Evaluating creative reasoning with Sudoku variants

TL;DR

Sudoku-Bench addresses the gap where large language models struggle to demonstrate genuine creative reasoning by focusing on Sudoku variants with break-ins—novel, multi-step constraints that resist memorization. It introduces open-source tooling around SudokuPad, a text-based puzzle representation, and a curated 100-puzzle benchmark spanning , , and grids, designed to probe long-horizon deduction and constraint interaction. The framework supports both multi-step and single-shot evaluation and includes expert reasoning traces to enable imitation learning, with baseline results showing less than 15% solve rates without tool use, especially on larger grids. Sudoku-Bench thus provides a controlled, extensible platform for advancing creative reasoning research, supported by rich human transcripts and open-source integrations for broad adoption and further study.

Abstract

Existing reasoning benchmarks for large language models (LLMs) frequently fail to capture authentic creativity, often rewarding memorization of previously observed patterns. We address this shortcoming with Sudoku-Bench, a curated benchmark of challenging and unconventional Sudoku variants specifically selected to evaluate creative, multi-step logical reasoning. Sudoku variants form an unusually effective domain for reasoning research: each puzzle introduces unique or subtly interacting constraints, making memorization infeasible and requiring solvers to identify novel logical breakthroughs (``break-ins''). Despite their diversity, Sudoku variants maintain a common and compact structure, enabling clear and consistent evaluation. Sudoku-Bench includes a carefully chosen puzzle set, a standardized text-based puzzle representation, and flexible tools compatible with thousands of publicly available puzzles -- making it easy to extend into a general research environment. Baseline experiments show that state-of-the-art LLMs solve fewer than 15\% of puzzles unaided, highlighting significant opportunities to advance long-horizon, strategic reasoning capabilities.

Paper Structure

This paper contains 20 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Each Sudoku variant has a unique set of constraints explicitly described in the puzzle rules. Puzzles may feature whimsical rules such as in Rat Run, or meta-level constraints, such as requiring all standard Sudoku rules to be intentionally violated.
  • Figure 2: Ascension example.
  • Figure 3: A text representation of a puzzle. The rules, initial grid, and a text description of visual elements are sufficient to unambiguously specify the puzzle.
  • Figure 4: Response categorization for the single-shot setting.
  • Figure 5: Sumthings example.