Table of Contents
Fetching ...

Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, Ismini Lourentzou

TL;DR

This work tackles the challenge of fine-grained spatial reasoning in vision-language models by introducing SpatialReasoner-R1, a LongCoT-capable VLM trained with fine-grained Direct Preference Optimization (fDPO). A data-generation pipeline, Multi-Model Monte Carlo Tree Search (M3CTS), produces diverse, logically consistent reasoning trajectories, guided by a fine-grained spatial reward suite that evaluates descriptive grounding, spatial accuracy, and logical coherence. Empirical results show that fDPO yields consistent improvements over standard DPO on spatial quality ($+4.1 ext{%}$) and spatial quantity ($+9.0 ext{%}$), while SpatialReasoner-R1 with fDPO achieves state-of-the-art performance on SPATIALRGPT-Bench (avg accuracy $+9.8 ext{%}$ over baselines) and maintains strong general vision-language capabilities. The combination of segment-wise optimization and reward-guided data generation advances robust, interpretable spatial reasoning in VLMs, with potential impact on robotics, AR/VR, and embodied AI, and future work extending to GUI navigation and 3D/embodied reasoning.

Abstract

Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we propose fine-grained Direct Preference Optimization (fDPO), which introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves an average improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0% gain in spatial quantity tasks. SpatialReasoner-R1, trained with fDPO, sets a new SoTA on SPATIALRGPT-Bench, outperforming the strongest baseline by 9.8% in average accuracy, while maintaining competitive performance on general vision-language tasks.

Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

TL;DR

This work tackles the challenge of fine-grained spatial reasoning in vision-language models by introducing SpatialReasoner-R1, a LongCoT-capable VLM trained with fine-grained Direct Preference Optimization (fDPO). A data-generation pipeline, Multi-Model Monte Carlo Tree Search (M3CTS), produces diverse, logically consistent reasoning trajectories, guided by a fine-grained spatial reward suite that evaluates descriptive grounding, spatial accuracy, and logical coherence. Empirical results show that fDPO yields consistent improvements over standard DPO on spatial quality () and spatial quantity (), while SpatialReasoner-R1 with fDPO achieves state-of-the-art performance on SPATIALRGPT-Bench (avg accuracy over baselines) and maintains strong general vision-language capabilities. The combination of segment-wise optimization and reward-guided data generation advances robust, interpretable spatial reasoning in VLMs, with potential impact on robotics, AR/VR, and embodied AI, and future work extending to GUI navigation and 3D/embodied reasoning.

Abstract

Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we propose fine-grained Direct Preference Optimization (fDPO), which introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves an average improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0% gain in spatial quantity tasks. SpatialReasoner-R1, trained with fDPO, sets a new SoTA on SPATIALRGPT-Bench, outperforming the strongest baseline by 9.8% in average accuracy, while maintaining competitive performance on general vision-language tasks.

Paper Structure

This paper contains 27 sections, 12 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Method Overview. To train SpatialReasoner-R1, we (1) generate reasoning paths using M3CTS, (2) construct fine-grained preference pairs via reward-based selection, and (3) train with fine-grained DPO (fDPO) to optimize descriptive and logical reasoning separately.
  • Figure 2: Architecture Overview. SpatialReasoner-R1 is a VLM that takes as input a text instruction, visual prompts, and an image, and generates LongCoT reasoning responses.
  • Figure 3: Fine-Grained Spatial Rewards. Candidate reasoning paths are decomposed into three aspects, descriptive, spatial, and reasoning, scored separately; the higher value in each row is marked by and the lower by . Explanation of Scoring:Descriptive: Negative response omits the two bar-stools and uses generic “modern kitchen” wording, whereas the positive response lists every salient object; Spatial: Negative response wrongly claims the island is lower than the rear counter and ignores the 20cm offset revealed by the stool reference, whereas the positive response provides its estimate to the 75cm stool height plus that offset; Reasoning: Negative response uses an illogical "half-height" heuristic $90\text{cm} \rightarrow 45\text{cm}$ without intermediate computation, whereas the positive response explicitly adds reference height and gap (75cm + 20cm = 95cm). These per-category deficits yield lower composite reward, designating the upper response as negative sample.
  • Figure 4: Qualitative Examples of Spatial Reasoning Across Models. SpatialReasoner-R1 demonstrates a coherent, multi-step logical chain that closely matches the ground truth, while other models exhibit less precise or less interpretable reasoning paths.
  • Figure 5: Example Reasoning Tree from the M3CTS Data Generation Pipeline. Diverse candidate reasoning paths are sampled from multiple models. Each path follows a structured LongCoT format with markdown-style section headers that decompose the answer into interpretable reasoning stages.
  • ...and 6 more figures