Table of Contents
Fetching ...

Perception-Aware Multimodal Spatial Reasoning from Monocular Images

Yanchun Cheng, Rundong Wang, Xulei Yang, Alok Prakash, Daniela Rus, Marcelo H Ang, ShiJie Li

TL;DR

This work proposes a simple yet effective perception-aware multimodal reasoning framework that equips VLMs with explicit object-centric grounding ability, and constructs a Multimodal Chain-of-Thought (MM-CoT) dataset that injects aligned visual and textual reasoning signals.

Abstract

Spatial reasoning from monocular images is essential for autonomous driving, yet current Vision-Language Models (VLMs) still struggle with fine-grained geometric perception, particularly under large scale variation and ambiguous object appearance. We propose a simple yet effective perception-aware multimodal reasoning framework that equips VLMs with explicit object-centric grounding ability. Instead of relying on textual bounding-box outputs, each referred object is represented using all Visual Reference Tokens (VRTs) within its spatial extent, enabling visual evidence and textual reasoning to be processed jointly in a unified token space. To further strengthen cross-modal interaction, we construct a Multimodal Chain-of-Thought (MM-CoT) dataset that injects aligned visual and textual reasoning signals. A deterministic ordering strategy is introduced to make supervision over inherently unordered VRT sets fully compatible with the VLM's autoregressive next-token prediction. With only standard supervised fine-tuning, our method achieves substantial improvements on the SURDS benchmark, outperforming previous approaches - including those using RL-based post-training - by a large margin across both single-object and multi-object tasks. These results demonstrate that accurate perception and multimodal reasoning are mutually reinforcing, and together form the key to robust spatial understanding in challenging monocular driving scenarios.

Perception-Aware Multimodal Spatial Reasoning from Monocular Images

TL;DR

This work proposes a simple yet effective perception-aware multimodal reasoning framework that equips VLMs with explicit object-centric grounding ability, and constructs a Multimodal Chain-of-Thought (MM-CoT) dataset that injects aligned visual and textual reasoning signals.

Abstract

Spatial reasoning from monocular images is essential for autonomous driving, yet current Vision-Language Models (VLMs) still struggle with fine-grained geometric perception, particularly under large scale variation and ambiguous object appearance. We propose a simple yet effective perception-aware multimodal reasoning framework that equips VLMs with explicit object-centric grounding ability. Instead of relying on textual bounding-box outputs, each referred object is represented using all Visual Reference Tokens (VRTs) within its spatial extent, enabling visual evidence and textual reasoning to be processed jointly in a unified token space. To further strengthen cross-modal interaction, we construct a Multimodal Chain-of-Thought (MM-CoT) dataset that injects aligned visual and textual reasoning signals. A deterministic ordering strategy is introduced to make supervision over inherently unordered VRT sets fully compatible with the VLM's autoregressive next-token prediction. With only standard supervised fine-tuning, our method achieves substantial improvements on the SURDS benchmark, outperforming previous approaches - including those using RL-based post-training - by a large margin across both single-object and multi-object tasks. These results demonstrate that accurate perception and multimodal reasoning are mutually reinforcing, and together form the key to robust spatial understanding in challenging monocular driving scenarios.
Paper Structure (13 sections, 6 equations, 2 figures, 2 tables)

This paper contains 13 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of the perception-aware multimodal reasoning framework. Visual tokens from a ViT encoder are projected into the LLM, while a Dynamic Embedding Module injects object tokens and index tokens to enable explicit object-centric grounding. Grounding markers delimit visual reference spans, allowing the LLM to jointly reason over text, visual cues, and object instances. Right: examples of detection and grounding, region understanding, and grounded image conversation supported by the framework.
  • Figure 2: Illustrative examples of the benchmark QA pairs on both single-object and multiobject