Table of Contents
Fetching ...

MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles

Yuheng Ji, Huajie Tan, Cheng Chi, Yijie Xu, Yuting Zhao, Enshen Zhou, Huaihai Lyu, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang, Xiaolong Zheng

TL;DR

MathSticks introduces a Visual Symbolic Compositional Reasoning benchmark using matchstick arithmetic to jointly assess perception, symbolic editing under strict constraints, and arithmetic verification. The authors present a large-scale, procedurally generated dataset (~1.41M solvable instances) with two evaluation regimes (text-prompted and pure-visual), plus a 400-item test set and diagnostic labels. Evaluations across 14 vision–language models reveal a clear gap between state-of-the-art closed models and humans, with open-source models performing near chance in the visual regime, highlighting the need for targeted training and architectural innovations for VSCR. The work provides a reproducible, fine-grained diagnostic framework and releases code and data to advance future research in visual-symbolic reasoning.

Abstract

We introduce \textsc{MathSticks}, a benchmark for Visual Symbolic Compositional Reasoning (VSCR), which unifies visual perception, symbolic manipulation, and arithmetic consistency. Each task presents an incorrect matchstick equation that must be corrected by moving one or two sticks under strict conservation rules. The benchmark includes both text-guided and purely visual settings, systematically covering digit scale, move complexity, solution multiplicity, and operator variation, with 1.4M generated instances and a curated test set. Evaluations of 14 vision--language models reveal substantial limitations: closed-source models succeed only on simple cases, open-source models fail in the visual regime, while humans exceed 90\% accuracy. These findings establish \textsc{MathSticks} as a rigorous testbed for advancing compositional reasoning across vision and symbols. Our code and dataset are publicly available at https://github.com/Yuheng2000/MathSticks.

MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles

TL;DR

MathSticks introduces a Visual Symbolic Compositional Reasoning benchmark using matchstick arithmetic to jointly assess perception, symbolic editing under strict constraints, and arithmetic verification. The authors present a large-scale, procedurally generated dataset (~1.41M solvable instances) with two evaluation regimes (text-prompted and pure-visual), plus a 400-item test set and diagnostic labels. Evaluations across 14 vision–language models reveal a clear gap between state-of-the-art closed models and humans, with open-source models performing near chance in the visual regime, highlighting the need for targeted training and architectural innovations for VSCR. The work provides a reproducible, fine-grained diagnostic framework and releases code and data to advance future research in visual-symbolic reasoning.

Abstract

We introduce \textsc{MathSticks}, a benchmark for Visual Symbolic Compositional Reasoning (VSCR), which unifies visual perception, symbolic manipulation, and arithmetic consistency. Each task presents an incorrect matchstick equation that must be corrected by moving one or two sticks under strict conservation rules. The benchmark includes both text-guided and purely visual settings, systematically covering digit scale, move complexity, solution multiplicity, and operator variation, with 1.4M generated instances and a curated test set. Evaluations of 14 vision--language models reveal substantial limitations: closed-source models succeed only on simple cases, open-source models fail in the visual regime, while humans exceed 90\% accuracy. These findings establish \textsc{MathSticks} as a rigorous testbed for advancing compositional reasoning across vision and symbols. Our code and dataset are publicly available at https://github.com/Yuheng2000/MathSticks.

Paper Structure

This paper contains 38 sections, 2 equations, 22 figures, 6 tables, 2 algorithms.

Figures (22)

  • Figure 1: Overview of the MathSticks task.
  • Figure 2: Figure 2: Illustration of the segment-level indexing scheme. Each digit position in the equation (indexed sequentially from left to right) is decomposed into seven labeled segments (0–6).
  • Figure 3: Example of the template library, showing digit slots with indexed segments and the operator slot. Each index corresponds to a movable matchstick.
  • Figure 4: Dataset distribution. (a) Proportions across difficulty levels. (b) Decomposition by move complexity, solution multiplicity, and operator flipping.
  • Figure 5: Prompt with text input. The symbolic equation string is provided together with the matchstick rendering.
  • ...and 17 more figures