Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

Yibin Huang; Wang Xu; Wanyue Zhang; Helu Zhi; Jingjing Huang; Yangbin Xu; Yangang Sun; Conghui Zhu; Tiejun Zhao

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, Conghui Zhu, Tiejun Zhao

TL;DR

This work addresses the challenge of spatial reasoning in multimodal models by replacing coarse grid-based cognitive maps with a metric-grounded BEV layout learned from video. It introduces Video2Layout, a two-stage training framework (SFT on synthetic data and RL with GRPO) to ground object coordinates and enable structured numerical reasoning, supported by the V2LO-28K dataset and the QVS-Bench diagnostic benchmark. Empirical results demonstrate that V2LO-7B achieves approximately 4.92% average improvement over grid-map baselines across mainstream spatial benchmarks, with robust sim-to-real generalization and clear evidence that 4–8 input frames yield optimal cognitive-map accuracy. These findings advance precise geometric reasoning in embodied AI and provide actionable benchmarks for future research in metric-grounded spatial cognition.

Abstract

Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model's real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at https://github.com/ybrrraway/Video2Layout.

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

TL;DR

Abstract

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)