Table of Contents
Fetching ...

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue

TL;DR

SpaceVista tackles all-scale visual spatial reasoning from $mm$ to $km$ by coupling an automated, real-world SpaceVista-1M dataset with a scale-aware model SpaceVista-7B. It introduces LoRA-like scale experts and a scale-router to manage cross-scale knowledge while employing a reward-driven progressive RL path to align reasoning with human-like scale understanding. The accompanying SpaceVista-Bench provides precise, measurement-based evaluation across indoor and all-scale scenes. Empirical results show SpaceVista-7B achieving competitive to state-of-the-art performance and clear cross-scale generalization, validating the efficacy of scale-aware specialization and anchors in learning spatial reasoning across diverse environments.

Abstract

With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

TL;DR

SpaceVista tackles all-scale visual spatial reasoning from to by coupling an automated, real-world SpaceVista-1M dataset with a scale-aware model SpaceVista-7B. It introduces LoRA-like scale experts and a scale-router to manage cross-scale knowledge while employing a reward-driven progressive RL path to align reasoning with human-like scale understanding. The accompanying SpaceVista-Bench provides precise, measurement-based evaluation across indoor and all-scale scenes. Empirical results show SpaceVista-7B achieving competitive to state-of-the-art performance and clear cross-scale generalization, validating the efficacy of scale-aware specialization and anchors in learning spatial reasoning across diverse environments.

Abstract

With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .

Paper Structure

This paper contains 59 sections, 5 equations, 31 figures, 46 tables.

Figures (31)

  • Figure 1: Prior works of spatial reasoning have largely focused on indoor ($1$-$30$ m) scenes, while our SpaceVista model and dataset span scales from $mm$ ($1e$-$3$ m) to $km$ ($1e$+$3$ m). This six-order-of-magnitude range introduces not only scale variation but also rich semantics and diverse tasks. SpaceVista enables all-scale spatial reasoning by integrating cues from micro-objects to macro-scenes.
  • Figure 2: (a) and (b) show model performance and dataset distribution across scales. Current models and datasets necessitate all-scale spatial reasoning.
  • Figure 3: Fig.(a) shows our automated data construction pipeline. The pie charts (b-c) depict the composition of scenes and sources. The bar charts (d–e) show object sizes ranging $mm$-$100m$, while object-to-camera distances typically span $10$-$600m$. Accordingly, we claim SpaceVista-1M basically covers the $mm$-$km$ scale. The word clouds (f-g) provide a glimpse of the scene diversity.
  • Figure 4: The left part (a-d) shows that the undifferentiated mixture of cross-scale knowledge hinders, rather than facilitates, the model’s reasoning process. The horizontal axis represents the scale discrepancy, defined as $\frac{answer}{gt}$ (=$1$ for the ideal situation), and the vertical axis denotes the proportion of answers. Fig.(e) is our SpaceVista model, where “<think>" is omitted for clarity.
  • Figure 5: Visualization of scale-expert activations on salient tokens with an appropriate threshold. This shows the router selects experts based on the input.
  • ...and 26 more figures