Table of Contents
Fetching ...

SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Md Tanvirul Alam, Saksham Aggarwal, Justin Yang Chae, Nidhi Rastogi

TL;DR

Sphinx introduces a synthetic, ground-truth‑driven environment for visual perception and reasoning, aiming to diagnose core primitives such as symmetry, transformation, and spatial relations. It modularizes data generation through motifs, tilings, and task classes, enabling 25 task types across five families and scalable dataset creation of $2{,}500$ questions with exact ground-truth solutions. Evaluation shows current LVLMs (e.g., GPT‑5) achieve around $51.1 ext{%}$ accuracy, well below human performance, while reinforcement learning with verifiable rewards (RLVR) yields consistent IID and some OOD gains and improves generalization to external benchmarks. The work demonstrates the value of verifiable supervision for multimodal reasoning, and the authors plan an open-source release to foster broader adoption and extension of the framework.

Abstract

We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.

SPHINX: A Synthetic Environment for Visual Perception and Reasoning

TL;DR

Sphinx introduces a synthetic, ground-truth‑driven environment for visual perception and reasoning, aiming to diagnose core primitives such as symmetry, transformation, and spatial relations. It modularizes data generation through motifs, tilings, and task classes, enabling 25 task types across five families and scalable dataset creation of questions with exact ground-truth solutions. Evaluation shows current LVLMs (e.g., GPT‑5) achieve around accuracy, well below human performance, while reinforcement learning with verifiable rewards (RLVR) yields consistent IID and some OOD gains and improves generalization to external benchmarks. The work demonstrates the value of verifiable supervision for multimodal reasoning, and the authors plan an open-source release to foster broader adoption and extension of the framework.

Abstract

We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.

Paper Structure

This paper contains 86 sections, 1 equation, 37 figures, 4 tables.

Figures (37)

  • Figure 1: Radar plot shows accuracies (%) achieved by LVLMs and by humans on the broad categories of Sphinx.
  • Figure 2: Example Motifs (from left): Crescent, Glyph, Pinwheel, Polygon, Polyomino and Icons
  • Figure 3: Example Tilings (from left): circles, square, triangular, hexagonal, rhombille.
  • Figure 4: Sphinx task illustrations
  • Figure 5: Familiarity vs Accuracy - Human Evaluators
  • ...and 32 more figures