Table of Contents
Fetching ...

MosaicThinker: On-Device Visual Spatial Reasoning for Embodied AI via Iterative Construction of Space Representation

Haoming Wang, Qiyao Xue, Weichen Liu, Wei Gao

TL;DR

This work tackles the challenge of cross-frame visual spatial reasoning for embodied AI on resource-constrained devices, where small on-device VLMs lack 3D spatial awareness. It introduces MosaicThinker, which builds a sparse global semantic map in a $SE(3)$ coordinate system by iteratively aligning per-frame spatial cues via transforms $T_{i\to j}$ and assembling a global path along a maximum spanning tree to compute $M_i \in SE(3)$, then guides the VLM with a BEV-like visual prompt. The method includes a preprocessing step to ground task objects, iterative semantic-map construction, topology-aware multi-frame alignment, and a Gaussian-kernel based key-frame selection, plus an occlusion refinement stage. Experiments across NVidia Jetson Orion and mobile devices on VSI-Bench, STI-Bench, and Metro-Spatial-QA demonstrate significant cross-frame spatial-reasoning improvements over training-free baselines, with feasible compute costs and robust performance under varying scene complexity.

Abstract

When embodied AI is expanding from traditional object detection and recognition to more advanced tasks of robot manipulation and actuation planning, visual spatial reasoning from the video inputs is necessary to perceive the spatial relationships of objects and guide device actions. However, existing visual language models (VLMs) have very weak capabilities in spatial reasoning due to the lack of knowledge about 3D spatial information, especially when the reasoning task involve complex spatial relations across multiple video frames. In this paper, we present a new inference-time computing technique for on-device embodied AI, namely \emph{MosaicThinker}, which enhances the on-device small VLM's spatial reasoning capabilities on difficult cross-frame reasoning tasks. Our basic idea is to integrate fragmented spatial information from multiple frames into a unified space representation of global semantic map, and further guide the VLM's spatial reasoning over the semantic map via a visual prompt. Experiment results show that our technique can greatly enhance the accuracy of cross-frame spatial reasoning on resource-constrained embodied AI devices, over reasoning tasks with diverse types and complexities.

MosaicThinker: On-Device Visual Spatial Reasoning for Embodied AI via Iterative Construction of Space Representation

TL;DR

This work tackles the challenge of cross-frame visual spatial reasoning for embodied AI on resource-constrained devices, where small on-device VLMs lack 3D spatial awareness. It introduces MosaicThinker, which builds a sparse global semantic map in a coordinate system by iteratively aligning per-frame spatial cues via transforms and assembling a global path along a maximum spanning tree to compute , then guides the VLM with a BEV-like visual prompt. The method includes a preprocessing step to ground task objects, iterative semantic-map construction, topology-aware multi-frame alignment, and a Gaussian-kernel based key-frame selection, plus an occlusion refinement stage. Experiments across NVidia Jetson Orion and mobile devices on VSI-Bench, STI-Bench, and Metro-Spatial-QA demonstrate significant cross-frame spatial-reasoning improvements over training-free baselines, with feasible compute costs and robust performance under varying scene complexity.

Abstract

When embodied AI is expanding from traditional object detection and recognition to more advanced tasks of robot manipulation and actuation planning, visual spatial reasoning from the video inputs is necessary to perceive the spatial relationships of objects and guide device actions. However, existing visual language models (VLMs) have very weak capabilities in spatial reasoning due to the lack of knowledge about 3D spatial information, especially when the reasoning task involve complex spatial relations across multiple video frames. In this paper, we present a new inference-time computing technique for on-device embodied AI, namely \emph{MosaicThinker}, which enhances the on-device small VLM's spatial reasoning capabilities on difficult cross-frame reasoning tasks. Our basic idea is to integrate fragmented spatial information from multiple frames into a unified space representation of global semantic map, and further guide the VLM's spatial reasoning over the semantic map via a visual prompt. Experiment results show that our technique can greatly enhance the accuracy of cross-frame spatial reasoning on resource-constrained embodied AI devices, over reasoning tasks with diverse types and complexities.
Paper Structure (29 sections, 8 equations, 21 figures, 10 tables)

This paper contains 29 sections, 8 equations, 21 figures, 10 tables.

Figures (21)

  • Figure 1: Different tasks in embodied AI applications
  • Figure 2: Construction of semantic map as the unified space representation
  • Figure 3: The existing VLM's performance discrepancy between appearance-based and spatial reasoning tasks. Outputs are generated by Qwen-2.5-VL-32B.
  • Figure 4: The VLM's spatial reasoning capability reduces with the model size
  • Figure 5: Incorrect cross-frame spatial reasoning where task-related objects are occluded or separately appear in frames. Outputs are generated by Gemini-2.5-Pro.
  • ...and 16 more figures