Ground-Truth Depth in Vision Language Models: Spatial Context Understanding in Conversational AI for XR-Robotic Support in Emergency First Response

Rodrigo Gutierrez Maquilon; Marita Hueber; Georg Regal; Manfred Tscheligi

Ground-Truth Depth in Vision Language Models: Spatial Context Understanding in Conversational AI for XR-Robotic Support in Emergency First Response

Rodrigo Gutierrez Maquilon, Marita Hueber, Georg Regal, Manfred Tscheligi

TL;DR

This study addresses the challenge of spatial reasoning in emergency first response (EFR) by integrating ground-truth depth from robot-mounted sensors with a vision-language model (VLM) to produce metrically grounded distance descriptions of objects in mixed reality. The authors implement a hybrid pipeline combining YOLO-based detection and depth measurements within a VLM loop, and evaluate it in a mixed-reality toxic-smoke scenario against three conditions: video-only, depth-agnostic VLM, and depth-augmented VLM. Results show that depth augmentation improves distance-estimation accuracy and reduces variance without increasing workload, while also elevating situational awareness and perceived usefulness. These findings demonstrate that metrically grounded, object-centric verbal information can enhance spatial reasoning and decision-making under time pressure in XR-robotic EFR, offering a concrete path toward more trustworthy human–AI collaboration in hazardous environments.

Abstract

Large language models (LLMs) are increasingly used in emergency first response (EFR) applications to support situational awareness (SA) and decision-making, yet most operate on text or 2D imagery and offer little support for core EFR SA competencies like spatial reasoning. We address this gap by evaluating a prototype that fuses robot-mounted depth sensing and YOLO detection with a vision language model (VLM) capable of verbalizing metrically-grounded distances of detected objects (e.g., the chair is 3.02 meters away). In a mixed-reality toxic-smoke scenario, participants estimated distances to a victim and an exit window under three conditions: video-only, depth-agnostic VLM, and depth-augmented VLM. Depth-augmentation improved objective accuracy and stability, e.g., the victim and window distance estimation error dropped, while raising situational awareness without increasing workload. Conversely, depth- agnostic assistance increased workload and slightly worsened accuracy. We contribute to human SA augmentation by demonstrating that metrically grounded, object-centric verbal information supports spatial reasoning in EFR and improves decision-relevant judgments under time pressure.

Ground-Truth Depth in Vision Language Models: Spatial Context Understanding in Conversational AI for XR-Robotic Support in Emergency First Response

TL;DR

Abstract

Paper Structure (31 sections, 3 figures, 4 tables)

This paper contains 31 sections, 3 figures, 4 tables.

Introduction
Related Work
Existing uses of LLMs in EFR training and operations
Human spatial context awareness and its limitations
Current Multimodal Spatial LLMs for 3D reasoning
Method
Mixed Reality Simulation
Aparatus and VLM Interaction
Study Design
Participants
Measurements
Situational Awareness
Perceived Workload
Voice Interaction
Perceived Usability
...and 16 more sections

Figures (3)

Figure 1: Integration diagram of the Jethexa robot, Unity application, vision language model and mixed reality head mounted display.
Figure 2: Condition 3. Vision language model qwen2.5vl:32b capture of depth camera feed with distance measurements integrated in the labels of YOLOv8x detected objects.
Figure 3: C1.2 and C1.3 = video-only baseline; C2 = VLM support; C3 = depth-augmented VLM support.

Ground-Truth Depth in Vision Language Models: Spatial Context Understanding in Conversational AI for XR-Robotic Support in Emergency First Response

TL;DR

Abstract

Ground-Truth Depth in Vision Language Models: Spatial Context Understanding in Conversational AI for XR-Robotic Support in Emergency First Response

Authors

TL;DR

Abstract

Table of Contents

Figures (3)