Table of Contents
Fetching ...

Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks

Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, Kai Chen

TL;DR

This work tackles the lack of generalizable spatial intelligence in vision-language models by treating Euclidean geometry problem solving as a surrogate task. It introduces Euclid30K, a large multimodal geometry dataset, and uses GRPO-based reinforcement learning to fine-tune seven model variants across 3–72B parameters. Results show consistent zero-shot improvements across four spatial benchmarks (Super-CLEVR, Omni3D-Bench, VSI-Bench, MindCube), with notable gains in VSI-Bench and MindCube, supporting the hypothesis that Euclidean priors enable transfer to diverse spatial tasks. The study also provides an educational-psychology rationale and rigorous ablations demonstrating that geometry priors generalize beyond task-specific data, offering a principled route to enhance spatial perception in multimodal models.

Abstract

Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs). To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. Furthermore, to enable the model to learn and apply Euclidean principles from these geometry problems, we fine-tuned seven model variants (spanning 3--72B parameters) from the Qwen2.5VL, Qwen3VL, and RoboBrain2.0 families using Group Relative Policy Optimization (GRPO), inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy rose from 36.6\% to 41.8\% (+5.2\%), and the mean MindCube accuracy rose from 31.4\% to 38.1\% (+6.7\%). To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in \href{https://zgca-ai4edu.github.io/Euclids_Gift}{this}.

Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks

TL;DR

This work tackles the lack of generalizable spatial intelligence in vision-language models by treating Euclidean geometry problem solving as a surrogate task. It introduces Euclid30K, a large multimodal geometry dataset, and uses GRPO-based reinforcement learning to fine-tune seven model variants across 3–72B parameters. Results show consistent zero-shot improvements across four spatial benchmarks (Super-CLEVR, Omni3D-Bench, VSI-Bench, MindCube), with notable gains in VSI-Bench and MindCube, supporting the hypothesis that Euclidean priors enable transfer to diverse spatial tasks. The study also provides an educational-psychology rationale and rigorous ablations demonstrating that geometry priors generalize beyond task-specific data, offering a principled route to enhance spatial perception in multimodal models.

Abstract

Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs). To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. Furthermore, to enable the model to learn and apply Euclidean principles from these geometry problems, we fine-tuned seven model variants (spanning 3--72B parameters) from the Qwen2.5VL, Qwen3VL, and RoboBrain2.0 families using Group Relative Policy Optimization (GRPO), inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy rose from 36.6\% to 41.8\% (+5.2\%), and the mean MindCube accuracy rose from 31.4\% to 38.1\% (+6.7\%). To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in \href{https://zgca-ai4edu.github.io/Euclids_Gift}{this}.

Paper Structure

This paper contains 30 sections, 12 equations, 15 figures, 13 tables.

Figures (15)

  • Figure 1: Performance gains on VSIBench after model training on Euclid30K, for more complete data please refer to \ref{['tab:vsibench']}.
  • Figure 2: The examples of the newly collected questions in Euclid30K. More examples can be found in the appendix.
  • Figure 3: Enhancing spatial perception and reasoning capabilities in models using the geometric problem-solving dataset (Euclid30K).
  • Figure 4: Performance improvement on SuperClevr Super-CLEVR_CVPR_2023, Omni3DBench Omni3DBench_arXiv_2025, VSIBench VSIBench_2025_CVPR, and MindCube MindCube_arXiv_2025 after the model has been trained on Eculid30K.
  • Figure 5: The response and final answer for Qwen2.5VL-7B Qwen2.5VL_2025_arXiv and Qwen2.5VL-Eculid-7B in Omni3DBech Omni3DBench_arXiv_2025.
  • ...and 10 more figures