Table of Contents
Fetching ...

WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation

Rafi Ibn Sultan, Hui Zhu, Xiangyu Zhou, Chengyin Li, Prashant Khanduri, Marco Brocanelli, Dongxiao Zhu

TL;DR

WalkGPT is introduced, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance.

Abstract

Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision-Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. We introduce WalkGPT, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features, along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the final image tokens by aggregating them along text tokens across spatial hierarchies, and a Calibrated Text Projector (CTP), guided by a proposed Region Alignment Loss, that maps language embeddings into segmentation-aware representations. These components enable fine-grained grounding and depth inference without user-provided cues or anchor points, allowing the model to generate complete and realistic navigation guidance. We also introduce PAVE, a large-scale benchmark of 41k pedestrian-view images paired with accessibility-aware questions and depth-grounded answers. Experiments show that WalkGPT achieves strong grounded reasoning and segmentation performance. The source code and dataset are available on the \href{https://sites.google.com/view/walkgpt-26/home}{project website}.

WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation

TL;DR

WalkGPT is introduced, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance.

Abstract

Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision-Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. We introduce WalkGPT, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features, along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the final image tokens by aggregating them along text tokens across spatial hierarchies, and a Calibrated Text Projector (CTP), guided by a proposed Region Alignment Loss, that maps language embeddings into segmentation-aware representations. These components enable fine-grained grounding and depth inference without user-provided cues or anchor points, allowing the model to generate complete and realistic navigation guidance. We also introduce PAVE, a large-scale benchmark of 41k pedestrian-view images paired with accessibility-aware questions and depth-grounded answers. Experiments show that WalkGPT achieves strong grounded reasoning and segmentation performance. The source code and dataset are available on the \href{https://sites.google.com/view/walkgpt-26/home}{project website}.
Paper Structure (21 sections, 14 equations, 9 figures, 5 tables)

This paper contains 21 sections, 14 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Overview of WalkGPT for accessibility-aware grounded navigation guide. The model grounds language on segmentation masks enriched with depth information, providing holistic spatial understanding that captures both object shapes and depth cues for interpretable accessibility analysis.
  • Figure 2: Overview of WalkGPT for grounded navigation guidance. (a) Overall framework. (b) The Multi-Scale Query Projector (MSQP), which aggregates multi-level visual features into spatially aligned image tokens for language reasoning. (c) The Calibrated Text Projector (CTP), guided by the proposed Region Alignment Loss, maps <SEG> tokens into the visual space. Structured tokens (<SEG>, <distance>, <assessment>, <p>) link language generation with segmentation and depth reasoning.
  • Figure 3: Pipeline for generating accessibility-aware VQA pairs in the PAVE dataset. The LLM receives the system prompt, detected features, their distance values, and the accessibility of the features, and generates structured outputs containing <assessment>, <distance>, <SEG>, and <p> tokens.
  • Figure 4: Qualitative results of WalkGPT on the PAVE validation set. Given a scene image, WalkGPT generates grounded conversations together with segmentation masks and depth-aware distance estimates, reflecting its understanding of accessibility and spatial context. Additional examples are provided in the Appendix.
  • Figure 5: Failure case study on PAVE. WalkGPT misinterprets strong road reflections on the building façade as physical obstacles, producing incorrect guidance even though the path itself is fully accessible. Part of the image is blurred for privacy.
  • ...and 4 more figures