Table of Contents
Fetching ...

Video Spatial Reasoning with Object-Centric 3D Rollout

Haoran Tang, Meng Cao, Ruyang Liu, Xiaoxi Liang, Linglong Li, Ge Li, Xiaodan Liang

TL;DR

This work tackles robust video spatial reasoning in 3D scenes by addressing query-focused biases in multimodal models. The authors propose Object-Centric 3D Rollout (OCR), which perturbs 3D object regions during training and projects changes to 2D, forcing the model to reason over global scene context. They couple OCR with a GRPO-based rollout framework and create OCR-SFT and OCR-RL datasets to enable both supervised and reinforcement-learning training, achieving a VSI-Bench average of 47.5% with a 3B model, outperforming several 7B baselines. Extensive ablations and visualizations demonstrate the superiority of geometry-aware, object-centric perturbations over prior rollout strategies, highlighting the importance of holistic, context-grounded spatial reasoning in video understanding.

Abstract

Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning-the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes-remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels the model to reason holistically across the entire scene. We further design a rollout-based training pipeline that jointly leverages vanilla and region-noisy videos to optimize spatial reasoning trajectories. Experiments demonstrate state-of-the-art performance: our 3B-parameter model achieves 47.5% accuracy on VSI-Bench, outperforming several 7B baselines. Ablations confirm OCR's superiority over prior rollout strategies (e.g., T-GRPO, NoisyRollout).

Video Spatial Reasoning with Object-Centric 3D Rollout

TL;DR

This work tackles robust video spatial reasoning in 3D scenes by addressing query-focused biases in multimodal models. The authors propose Object-Centric 3D Rollout (OCR), which perturbs 3D object regions during training and projects changes to 2D, forcing the model to reason over global scene context. They couple OCR with a GRPO-based rollout framework and create OCR-SFT and OCR-RL datasets to enable both supervised and reinforcement-learning training, achieving a VSI-Bench average of 47.5% with a 3B model, outperforming several 7B baselines. Extensive ablations and visualizations demonstrate the superiority of geometry-aware, object-centric perturbations over prior rollout strategies, highlighting the importance of holistic, context-grounded spatial reasoning in video understanding.

Abstract

Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning-the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes-remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels the model to reason holistically across the entire scene. We further design a rollout-based training pipeline that jointly leverages vanilla and region-noisy videos to optimize spatial reasoning trajectories. Experiments demonstrate state-of-the-art performance: our 3B-parameter model achieves 47.5% accuracy on VSI-Bench, outperforming several 7B baselines. Ablations confirm OCR's superiority over prior rollout strategies (e.g., T-GRPO, NoisyRollout).

Paper Structure

This paper contains 13 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Motivation of our proposed OCR. The previous method exhibits query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. Our method encourages MLLMs to rely on surrounding context and inter-object relationships for comprehensive spatial reasoning.
  • Figure 2: Overview of our Object-Centric 3D Rollout (OCR) strategy. Given a video scene, we generate additional rollouts by injecting spatial noise into selected object regions using a step-wise object selector. These perturbed rollouts are used for advantage estimation, while only the clean rollouts (outlined in black) contribute to gradient updates.
  • Figure 3: Distribution of the constructed dataset. The chart illustrates the category-wise distribution of our data. Light blue denotes the training set, while blue indicates the curated cold-start subset. Categories are sorted in descending order based on their frequency in the RL data.
  • Figure 4: Visualization samples of our method. Text in green indicates instances where the model incorporates additional relevant objects beyond those mentioned in the query, enhancing spatial reasoning. Text in blue reflects the model's ability to reason from a global scene perspective rather than relying solely on local visual cues.