Table of Contents
Fetching ...

SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization

Peiyao Wang, Haibin Ling

TL;DR

This work tackles the weak spatial reasoning of open-source vision-language models by introducing SVQA-R1, a reinforcement learning framework that enforces view-consistent spatial understanding. It combines mirror-based QA augmentation with a mixed reward design—comprising a format-based and a semantic-based component—into Spatial-GRPO, a policy optimization objective that leverages original and mirrored views for robust reasoning. The approach yields substantial gains on open-ended Spatial VQA and numerical spatial tasks, outperforming many open-source baselines and approaching the performance of advanced LLM-driven systems, while offering interpretable, step-wise reasoning paths. The results demonstrate that view-consistent RL can significantly enhance grounded spatial reasoning in multimodal models with limited supervised data.

Abstract

Spatial reasoning remains a critical yet underdeveloped capability in existing vision-language models (VLMs), especially for Spatial Visual Question Answering (Spatial VQA) tasks that require understanding relative positions, distances, and object configurations. Inspired by the R1 paradigm introduced in DeepSeek-R1, which enhances reasoning in language models through rule-based reinforcement learning (RL), we propose SVQA-R1, the first framework to extend R1-style training to spatial VQA. In particular, we introduce Spatial-GRPO, a novel group-wise RL strategy that constructs view-consistent rewards by perturbing spatial relations between objects, e.g., mirror flipping, thereby encouraging the model to develop a consistent and grounded understanding of space. Our model, SVQA-R1, not only achieves dramatically improved accuracy on spatial VQA benchmarks but also exhibits interpretable reasoning paths even without using supervised fine-tuning (SFT) data. Extensive experiments and visualization demonstrate the effectiveness of SVQA-R1 across multiple spatial reasoning benchmarks.

SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization

TL;DR

This work tackles the weak spatial reasoning of open-source vision-language models by introducing SVQA-R1, a reinforcement learning framework that enforces view-consistent spatial understanding. It combines mirror-based QA augmentation with a mixed reward design—comprising a format-based and a semantic-based component—into Spatial-GRPO, a policy optimization objective that leverages original and mirrored views for robust reasoning. The approach yields substantial gains on open-ended Spatial VQA and numerical spatial tasks, outperforming many open-source baselines and approaching the performance of advanced LLM-driven systems, while offering interpretable, step-wise reasoning paths. The results demonstrate that view-consistent RL can significantly enhance grounded spatial reasoning in multimodal models with limited supervised data.

Abstract

Spatial reasoning remains a critical yet underdeveloped capability in existing vision-language models (VLMs), especially for Spatial Visual Question Answering (Spatial VQA) tasks that require understanding relative positions, distances, and object configurations. Inspired by the R1 paradigm introduced in DeepSeek-R1, which enhances reasoning in language models through rule-based reinforcement learning (RL), we propose SVQA-R1, the first framework to extend R1-style training to spatial VQA. In particular, we introduce Spatial-GRPO, a novel group-wise RL strategy that constructs view-consistent rewards by perturbing spatial relations between objects, e.g., mirror flipping, thereby encouraging the model to develop a consistent and grounded understanding of space. Our model, SVQA-R1, not only achieves dramatically improved accuracy on spatial VQA benchmarks but also exhibits interpretable reasoning paths even without using supervised fine-tuning (SFT) data. Extensive experiments and visualization demonstrate the effectiveness of SVQA-R1 across multiple spatial reasoning benchmarks.

Paper Structure

This paper contains 39 sections, 5 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: (a) The original image, question, and answer of a sample. (b) The flipped image, question, and the answers before and after verification enhancement.
  • Figure 2: A diverse set of open-ended spatial question-answer types.
  • Figure 3: Visualization of different open-ended spatial question-answer types.
  • Figure 4: Visualization of original and flipped image and QA for left-right spatial reasoning.
  • Figure 5: Visualization of original and flipped image and QA for bounding box consistency.
  • ...and 4 more figures