Table of Contents
Fetching ...

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, Conghui Zhu, Tiejun Zhao

TL;DR

This work addresses the challenge of spatial reasoning in multimodal models by replacing coarse grid-based cognitive maps with a metric-grounded BEV layout learned from video. It introduces Video2Layout, a two-stage training framework (SFT on synthetic data and RL with GRPO) to ground object coordinates and enable structured numerical reasoning, supported by the V2LO-28K dataset and the QVS-Bench diagnostic benchmark. Empirical results demonstrate that V2LO-7B achieves approximately 4.92% average improvement over grid-map baselines across mainstream spatial benchmarks, with robust sim-to-real generalization and clear evidence that 4–8 input frames yield optimal cognitive-map accuracy. These findings advance precise geometric reasoning in embodied AI and provide actionable benchmarks for future research in metric-grounded spatial cognition.

Abstract

Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model's real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at https://github.com/ybrrraway/Video2Layout.

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

TL;DR

This work addresses the challenge of spatial reasoning in multimodal models by replacing coarse grid-based cognitive maps with a metric-grounded BEV layout learned from video. It introduces Video2Layout, a two-stage training framework (SFT on synthetic data and RL with GRPO) to ground object coordinates and enable structured numerical reasoning, supported by the V2LO-28K dataset and the QVS-Bench diagnostic benchmark. Empirical results demonstrate that V2LO-7B achieves approximately 4.92% average improvement over grid-map baselines across mainstream spatial benchmarks, with robust sim-to-real generalization and clear evidence that 4–8 input frames yield optimal cognitive-map accuracy. These findings advance precise geometric reasoning in embodied AI and provide actionable benchmarks for future research in metric-grounded spatial cognition.

Abstract

Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model's real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at https://github.com/ybrrraway/Video2Layout.

Paper Structure

This paper contains 17 sections, 6 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparison of cognitive map representations. A conventional grid map (left) introduces metric and semantic inaccuracies regarding real distance, object size, and precise direction. In contrast, our method generates a metric-grounded map (right) that assigns precise Bird’s-Eye View (BEV) coordinates to objects in an observer-centered perspective, establishing a quantitative foundation for fine-grained spatial reasoning.
  • Figure 2: The overall framework diagram of Video2Layout. (1) Data preparation stage focuses on generating QA pairs from simulated spatial, real spatial, and general-domain data sources. (2) SFT stage aims to train the model on simulated spatial data and general-domain data, enabling model to generate a metric-grounded map and adopt a structured reasoning output format. (3) RFT stage leverages the GRPO algorithm for training on real-world spatial data, effectively realizing the generalization of real scenarios.
  • Figure 3: A detailed overview of the QVS-Bench benchmark. Left: The "Sampling Configurations" panel depicts the proportional composition of data samples across five distinct input frame counts (1, 4, 8, 12, and 16). Right: The "Task Type" panel details the proportional breakdown of the six spatial reasoning tasks, with each category accompanied by its corresponding example question template.
  • Figure 4: Illustrative examples of the structured spatial reasoning process. The structured COT follows a unified approach: mapping objects onto a metric-grounded map to convert spatial reasoning into mathematical computation. (Left) For relative distance questions, this involves computing Euclidean distances after mapping. (Right)For complex perspective transformation questions, it further includes establishing a local coordinate system and leveraging vector operations to determine relative orientation.
  • Figure 5: case study (a) in simulation scene.
  • ...and 1 more figures