SPHINX: A Synthetic Environment for Visual Perception and Reasoning
Md Tanvirul Alam, Saksham Aggarwal, Justin Yang Chae, Nidhi Rastogi
TL;DR
Sphinx introduces a synthetic, ground-truth‑driven environment for visual perception and reasoning, aiming to diagnose core primitives such as symmetry, transformation, and spatial relations. It modularizes data generation through motifs, tilings, and task classes, enabling 25 task types across five families and scalable dataset creation of $2{,}500$ questions with exact ground-truth solutions. Evaluation shows current LVLMs (e.g., GPT‑5) achieve around $51.1 ext{%}$ accuracy, well below human performance, while reinforcement learning with verifiable rewards (RLVR) yields consistent IID and some OOD gains and improves generalization to external benchmarks. The work demonstrates the value of verifiable supervision for multimodal reasoning, and the authors plan an open-source release to foster broader adoption and extension of the framework.
Abstract
We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.
