Table of Contents
Fetching ...

TopoBench: Benchmarking LLMs on Hard Topological Reasoning

Mayug Maniparambil, Nils Hoehing, Janak Kapuriya, Arjun Karuvally, Ellen Rushe, Anthony Ventresque, Noel O'Connor, Fergal Reid

Abstract

Solving topological grid puzzles requires reasoning over global spatial invariants such as connectivity, loop closure, and region symmetry and remains challenging for even the most powerful large language models (LLMs). To study these abilities under controlled settings, we introduce TopoBench, a benchmark of six puzzle families across three difficulty levels. We evaluate strong reasoning LLMs on TopoBench and find that even frontier models solve fewer than one quarter of hard instances, with two families nearly unsolved. To investigate whether these failures stem from reasoning limitations or from difficulty extracting and maintaining spatial constraints, we annotate 750 chain of thought traces with an error taxonomy that surfaces four candidate causal failure modes, then test them with targeted interventions simulating each error type. These interventions show that certain error patterns like premature commitment and constraint forgetting have a direct impact on the ability to solve the puzzle, while repeated reasoning is a benign effect of search. Finally we study mitigation strategies including prompt guidance, cell-aligned grid representations and tool-based constraint checking, finding that the bottleneck lies in extracting constraints from spatial representations and not in reasoning over them. Code and data are available at github.com/mayug/topobench-benchmark.

TopoBench: Benchmarking LLMs on Hard Topological Reasoning

Abstract

Solving topological grid puzzles requires reasoning over global spatial invariants such as connectivity, loop closure, and region symmetry and remains challenging for even the most powerful large language models (LLMs). To study these abilities under controlled settings, we introduce TopoBench, a benchmark of six puzzle families across three difficulty levels. We evaluate strong reasoning LLMs on TopoBench and find that even frontier models solve fewer than one quarter of hard instances, with two families nearly unsolved. To investigate whether these failures stem from reasoning limitations or from difficulty extracting and maintaining spatial constraints, we annotate 750 chain of thought traces with an error taxonomy that surfaces four candidate causal failure modes, then test them with targeted interventions simulating each error type. These interventions show that certain error patterns like premature commitment and constraint forgetting have a direct impact on the ability to solve the puzzle, while repeated reasoning is a benign effect of search. Finally we study mitigation strategies including prompt guidance, cell-aligned grid representations and tool-based constraint checking, finding that the bottleneck lies in extracting constraints from spatial representations and not in reasoning over them. Code and data are available at github.com/mayug/topobench-benchmark.
Paper Structure (46 sections, 15 figures, 22 tables)

This paper contains 46 sections, 15 figures, 22 tables.

Figures (15)

  • Figure 1: The six TopoBench puzzle families organized by the global spatial constraint each targets: path connectivity (Flow Free), network connectivity (Bridges), loop closure (Loopy), region partitioning under rotational symmetry (Galaxies), visibility through reflection (Undead), and contiguity across intersecting axes (Pattern).
  • Figure 2: Prevalence of the seven main error categories among incorrect traces, pooled across difficulty tiers: all fivepuzzle types ($n{=}455$, left), Bridges ($n{=}45$, center), and Undead ($n{=}84$, right). ES dominates the aggregate and Undead panels as a downstream symptom of failure. STF is the leading category on Bridges (47%), while PC (37%) and RD (43%) are prominent on Undead. CF is the rarest category across all panels. Four additional low-frequency categories are reported in Appendix \ref{['app:full_error_taxonomy']}.
  • Figure 3: Intervention effects on Bridges (circles, $N{=}300$) and Undead (squares, $N{=}300$). Points show total accuracy with 95% Wilson CIs; shaded bands mark the baseline. PC and CF produce large, significant drops on both puzzles; STF reaches significance only on Undead; RR variants are indistinguishable from baseline.
  • Figure 4: Tokenization of a Bridges puzzle with GPT-5-mini's tokenizer tiktoken. ASCII (left) produces ragged boundaries that straddle grid cells; IntFormat (center) and IntFormat-JSON (right) yield uniform, cell-aligned tokens preserving board structure.
  • Figure 5: Spearman rank correlations between model performance on TopoBench, existing puzzle benchmarks (KORGym, Enigmata), and general reasoning benchmarks (ARC-AGI-1/2, AIME 2025, AA Intelligence). All puzzle benchmarks correlate with existing benchmarks.
  • ...and 10 more figures