Video Spatial Reasoning with Object-Centric 3D Rollout

Haoran Tang; Meng Cao; Ruyang Liu; Xiaoxi Liang; Linglong Li; Ge Li; Xiaodan Liang

Video Spatial Reasoning with Object-Centric 3D Rollout

Haoran Tang, Meng Cao, Ruyang Liu, Xiaoxi Liang, Linglong Li, Ge Li, Xiaodan Liang

TL;DR

This work tackles robust video spatial reasoning in 3D scenes by addressing query-focused biases in multimodal models. The authors propose Object-Centric 3D Rollout (OCR), which perturbs 3D object regions during training and projects changes to 2D, forcing the model to reason over global scene context. They couple OCR with a GRPO-based rollout framework and create OCR-SFT and OCR-RL datasets to enable both supervised and reinforcement-learning training, achieving a VSI-Bench average of 47.5% with a 3B model, outperforming several 7B baselines. Extensive ablations and visualizations demonstrate the superiority of geometry-aware, object-centric perturbations over prior rollout strategies, highlighting the importance of holistic, context-grounded spatial reasoning in video understanding.

Abstract

Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning-the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes-remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels the model to reason holistically across the entire scene. We further design a rollout-based training pipeline that jointly leverages vanilla and region-noisy videos to optimize spatial reasoning trajectories. Experiments demonstrate state-of-the-art performance: our 3B-parameter model achieves 47.5% accuracy on VSI-Bench, outperforming several 7B baselines. Ablations confirm OCR's superiority over prior rollout strategies (e.g., T-GRPO, NoisyRollout).

Video Spatial Reasoning with Object-Centric 3D Rollout

TL;DR

Abstract

Video Spatial Reasoning with Object-Centric 3D Rollout

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)