Perception-Aware Multimodal Spatial Reasoning from Monocular Images

Yanchun Cheng; Rundong Wang; Xulei Yang; Alok Prakash; Daniela Rus; Marcelo H Ang; ShiJie Li

Perception-Aware Multimodal Spatial Reasoning from Monocular Images

Yanchun Cheng, Rundong Wang, Xulei Yang, Alok Prakash, Daniela Rus, Marcelo H Ang, ShiJie Li

TL;DR

This work proposes a simple yet effective perception-aware multimodal reasoning framework that equips VLMs with explicit object-centric grounding ability, and constructs a Multimodal Chain-of-Thought (MM-CoT) dataset that injects aligned visual and textual reasoning signals.

Abstract

Spatial reasoning from monocular images is essential for autonomous driving, yet current Vision-Language Models (VLMs) still struggle with fine-grained geometric perception, particularly under large scale variation and ambiguous object appearance. We propose a simple yet effective perception-aware multimodal reasoning framework that equips VLMs with explicit object-centric grounding ability. Instead of relying on textual bounding-box outputs, each referred object is represented using all Visual Reference Tokens (VRTs) within its spatial extent, enabling visual evidence and textual reasoning to be processed jointly in a unified token space. To further strengthen cross-modal interaction, we construct a Multimodal Chain-of-Thought (MM-CoT) dataset that injects aligned visual and textual reasoning signals. A deterministic ordering strategy is introduced to make supervision over inherently unordered VRT sets fully compatible with the VLM's autoregressive next-token prediction. With only standard supervised fine-tuning, our method achieves substantial improvements on the SURDS benchmark, outperforming previous approaches - including those using RL-based post-training - by a large margin across both single-object and multi-object tasks. These results demonstrate that accurate perception and multimodal reasoning are mutually reinforcing, and together form the key to robust spatial understanding in challenging monocular driving scenarios.

Perception-Aware Multimodal Spatial Reasoning from Monocular Images

TL;DR

Abstract

Paper Structure (13 sections, 6 equations, 2 figures, 2 tables)

This paper contains 13 sections, 6 equations, 2 figures, 2 tables.

INTRODUCTION
Related Works
Methodology
Preliminary
Overview
Multi-Modal Chain-of-Thought Format
Supervision
Experiments
Experimental Setting
Comparison to the state-of-the-art
Ablation Study
Future Work & Limitation
Conclusion

Figures (2)

Figure 1: Overview of the perception-aware multimodal reasoning framework. Visual tokens from a ViT encoder are projected into the LLM, while a Dynamic Embedding Module injects object tokens and index tokens to enable explicit object-centric grounding. Grounding markers delimit visual reference spans, allowing the LLM to jointly reason over text, visual cues, and object instances. Right: examples of detection and grounding, region understanding, and grounded image conversation supported by the framework.
Figure 2: Illustrative examples of the benchmark QA pairs on both single-object and multiobject

Perception-Aware Multimodal Spatial Reasoning from Monocular Images

TL;DR

Abstract

Perception-Aware Multimodal Spatial Reasoning from Monocular Images

Authors

TL;DR

Abstract

Table of Contents

Figures (2)