Table of Contents
Fetching ...

Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning

Tanner Muturi, Blessing Agyei Kyem, Joshua Kofi Asamoah, Neema Jakisa Owor, Richard Dyzinela, Andrews Danyo, Yaw Adu-Gyamfi, Armstrong Aboah

TL;DR

The paper tackles robust spatial reasoning in cluttered warehouse scenes by augmenting RGB-D transformers with prompt-based geometric grounding. It introduces a SpatialBot-based architecture that embeds object bounding box coordinates into prompts and uses an answer normalization module to align outputs with evaluation protocols, trained on the Physical AI Spatial Intelligence Warehouse dataset across four spatial tasks. The method demonstrates that explicit spatial grounding and depth cues substantially improve performance, achieving an S1 score of 73.0606 and ranking 4th on Track 3 of the AI City Challenge. This work offers a practical path toward depth-aware, geometry-grounded vision-language systems for industrial applications, reducing reliance on 2D appearance cues and enabling reliable multi-object reasoning.

Abstract

Spatial reasoning in large-scale 3D environments such as warehouses remains a significant challenge for vision-language systems due to scene clutter, occlusions, and the need for precise spatial understanding. Existing models often struggle with generalization in such settings, as they rely heavily on local appearance and lack explicit spatial grounding. In this work, we introduce a dedicated spatial reasoning framework for the Physical AI Spatial Intelligence Warehouse dataset introduced in the Track 3 2025 AI City Challenge. Our approach enhances spatial comprehension by embedding mask dimensions in the form of bounding box coordinates directly into the input prompts, enabling the model to reason over object geometry and layout. We fine-tune the framework across four question categories namely: Distance Estimation, Object Counting, Multi-choice Grounding, and Spatial Relation Inference using task-specific supervision. To further improve consistency with the evaluation system, normalized answers are appended to the GPT response within the training set. Our comprehensive pipeline achieves a final score of 73.0606, placing 4th overall on the public leaderboard. These results demonstrate the effectiveness of structured prompt enrichment and targeted optimization in advancing spatial reasoning for real-world industrial environments.

Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning

TL;DR

The paper tackles robust spatial reasoning in cluttered warehouse scenes by augmenting RGB-D transformers with prompt-based geometric grounding. It introduces a SpatialBot-based architecture that embeds object bounding box coordinates into prompts and uses an answer normalization module to align outputs with evaluation protocols, trained on the Physical AI Spatial Intelligence Warehouse dataset across four spatial tasks. The method demonstrates that explicit spatial grounding and depth cues substantially improve performance, achieving an S1 score of 73.0606 and ranking 4th on Track 3 of the AI City Challenge. This work offers a practical path toward depth-aware, geometry-grounded vision-language systems for industrial applications, reducing reliance on 2D appearance cues and enabling reliable multi-object reasoning.

Abstract

Spatial reasoning in large-scale 3D environments such as warehouses remains a significant challenge for vision-language systems due to scene clutter, occlusions, and the need for precise spatial understanding. Existing models often struggle with generalization in such settings, as they rely heavily on local appearance and lack explicit spatial grounding. In this work, we introduce a dedicated spatial reasoning framework for the Physical AI Spatial Intelligence Warehouse dataset introduced in the Track 3 2025 AI City Challenge. Our approach enhances spatial comprehension by embedding mask dimensions in the form of bounding box coordinates directly into the input prompts, enabling the model to reason over object geometry and layout. We fine-tune the framework across four question categories namely: Distance Estimation, Object Counting, Multi-choice Grounding, and Spatial Relation Inference using task-specific supervision. To further improve consistency with the evaluation system, normalized answers are appended to the GPT response within the training set. Our comprehensive pipeline achieves a final score of 73.0606, placing 4th overall on the public leaderboard. These results demonstrate the effectiveness of structured prompt enrichment and targeted optimization in advancing spatial reasoning for real-world industrial environments.

Paper Structure

This paper contains 21 sections, 4 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Example of spatial prompt transformation. The original prompt (top) uses natural language placeholders. The modified prompt encodes explicit bounding box coordinates, and the GPT-style answer is reformatted with a normalized response for consistent evaluation.
  • Figure 2: Overview of our spatial reasoning architecture. The system processes RGB and depth images through a shared image encoder (SigLIP), while textual prompts are normalized and encoded separately. A vision-language transformer fuses the modalities to generate free-form responses. An answer normalization module extracts concise outputs. Spatial grounding is enabled by injecting bounding box coordinates and region identifiers into the prompts.
  • Figure 3: Qualitative example illustrating the model's ability to count pallets within the buffer region closest to the rightmost shelf. The model correctly identifies Region 14 as the shelf, Region 1 as the closest buffer zone, and detects three relevant pallet regions within the specified area.
  • Figure 4: Qualitative example demonstrating the model’s capability in pairwise spatial comparison. The model accurately infers that Region 0 lies to the left of Region 1 from the given viewpoint and bounding box inputs.