Table of Contents
Fetching ...

Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning

Justin Waugh

TL;DR

This work introduces Pencil Puzzle Bench, a framework for evaluating large language model reasoning through pencil puzzles, a family of constraint-satisfaction problems closely related to NP-complete problems, with deterministic, step-level verification.

Abstract

We introduce Pencil Puzzle Bench, a framework for evaluating large language model reasoning through pencil puzzles, a family of constraint-satisfaction problems closely related to NP-complete problems, with deterministic, step-level verification. From a database of 62,231 puzzles across 94 varieties with verified unique solutions, we select a benchmark of 300 puzzles spanning 20 varieties and evaluate 51 models from 11 providers in two modes: direct ask (single-shot) and agentic (multi-turn with iterative verification). A key differentiator of our benchmark is that every intermediate board state can be checked against variety-specific constraints, localizing errors to the exact rule violated, providing the infrastructure for dense, per-move reward signals for process supervision and reinforcement learning. Our evaluation reveals two distinct axes of capability: (1) reasoning effort scaling, where GPT-5.2 improves 81x from no reasoning to maximum effort; and (2) agentic iteration, where Claude Opus 4.6 rises from 0.3% to 30.0% through iterative checking, while GPT-5.2@xhigh improves from 20.2% to 56.0%. Agentic attempts span a median of 29 turns over 17 minutes, with the longest exceeding 1,221 turns and 14.3 hours - a demanding test of long-context utilization, not just reasoning.

Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning

TL;DR

This work introduces Pencil Puzzle Bench, a framework for evaluating large language model reasoning through pencil puzzles, a family of constraint-satisfaction problems closely related to NP-complete problems, with deterministic, step-level verification.

Abstract

We introduce Pencil Puzzle Bench, a framework for evaluating large language model reasoning through pencil puzzles, a family of constraint-satisfaction problems closely related to NP-complete problems, with deterministic, step-level verification. From a database of 62,231 puzzles across 94 varieties with verified unique solutions, we select a benchmark of 300 puzzles spanning 20 varieties and evaluate 51 models from 11 providers in two modes: direct ask (single-shot) and agentic (multi-turn with iterative verification). A key differentiator of our benchmark is that every intermediate board state can be checked against variety-specific constraints, localizing errors to the exact rule violated, providing the infrastructure for dense, per-move reward signals for process supervision and reinforcement learning. Our evaluation reveals two distinct axes of capability: (1) reasoning effort scaling, where GPT-5.2 improves 81x from no reasoning to maximum effort; and (2) agentic iteration, where Claude Opus 4.6 rises from 0.3% to 30.0% through iterative checking, while GPT-5.2@xhigh improves from 20.2% to 56.0%. Agentic attempts span a median of 29 turns over 17 minutes, with the longest exceeding 1,221 turns and 14.3 hours - a demanding test of long-context utilization, not just reasoning.
Paper Structure (75 sections, 7 figures, 6 tables)

This paper contains 75 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Model success rates by strategy (left), with a 3$\times$3 gallery of partial and complete solves across diverse attempts (right).
  • Figure 2: A norinori puzzle showing the solving process. Red highlights in (b) indicate cells where variety-specific constraints are violated; the verifier reports which rule is broken (e.g., "Shaded cell has no adjacent shaded cell"), enabling targeted feedback in agentic mode.
  • Figure 3: GPT-5.2 outcome breakdown by reasoning effort level (direct ask, 300 puzzles). Each bar sums to 100%. Correct answers (blue) improve 81$\times$ from none to xhigh, but at xhigh, 35% of requests fail before returning a response (red), revealing a sharp reliability/capability tradeoff.
  • Figure 4: Model success rates over time with frontier model release dates annotated. The progression shows both generational improvement within model families and the gap between frontier and non-frontier models.
  • Figure 5: Cost per success vs. success rate across models. The Pareto frontier (lower-right) shows the cost-efficiency tradeoff: Grok 4.1 Fast achieves the lowest cost per success ($0.01) while GPT-5.2@xhigh achieves the highest success rate (56.0%).
  • ...and 2 more figures