Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, Ismini Lourentzou
TL;DR
This work tackles the challenge of fine-grained spatial reasoning in vision-language models by introducing SpatialReasoner-R1, a LongCoT-capable VLM trained with fine-grained Direct Preference Optimization (fDPO). A data-generation pipeline, Multi-Model Monte Carlo Tree Search (M3CTS), produces diverse, logically consistent reasoning trajectories, guided by a fine-grained spatial reward suite that evaluates descriptive grounding, spatial accuracy, and logical coherence. Empirical results show that fDPO yields consistent improvements over standard DPO on spatial quality ($+4.1 ext{%}$) and spatial quantity ($+9.0 ext{%}$), while SpatialReasoner-R1 with fDPO achieves state-of-the-art performance on SPATIALRGPT-Bench (avg accuracy $+9.8 ext{%}$ over baselines) and maintains strong general vision-language capabilities. The combination of segment-wise optimization and reward-guided data generation advances robust, interpretable spatial reasoning in VLMs, with potential impact on robotics, AR/VR, and embodied AI, and future work extending to GUI navigation and 3D/embodied reasoning.
Abstract
Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we propose fine-grained Direct Preference Optimization (fDPO), which introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves an average improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0% gain in spatial quantity tasks. SpatialReasoner-R1, trained with fDPO, sets a new SoTA on SPATIALRGPT-Bench, outperforming the strongest baseline by 9.8% in average accuracy, while maintaining competitive performance on general vision-language tasks.
