Table of Contents
Fetching ...

mmWalk: Towards Multi-modal Multi-view Walking Assistance

Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

TL;DR

mmWalk introduces a simulated, multi-view, multi-modal walking dataset and an accompanying VQA benchmark (mmWalkVQA) tailored for outdoor navigation assistance for BLV users. By capturing synchronized panomoric frames from walker, drone, and guide-dog perspectives across RGB, depth, and semantic modalities, and by annotating corner cases and navigational landmarks, the work provides a rich platform for evaluating multi-modal perception and reasoning in safety-critical scenarios. Benchmarking reveals that state-of-the-art Vision-Language Models struggle with risk assessment and navigational tasks, but fine-tuning on mmWalk improves performance and transfers to real-world VQA (EgoTextVQA), underscoring the dataset’s practical value. Overall, mmWalk offers a valuable resource for developing safer, more context-aware assistive technologies and evaluating multi-view VLMs in embodied perception settings.

Abstract

Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.

mmWalk: Towards Multi-modal Multi-view Walking Assistance

TL;DR

mmWalk introduces a simulated, multi-view, multi-modal walking dataset and an accompanying VQA benchmark (mmWalkVQA) tailored for outdoor navigation assistance for BLV users. By capturing synchronized panomoric frames from walker, drone, and guide-dog perspectives across RGB, depth, and semantic modalities, and by annotating corner cases and navigational landmarks, the work provides a rich platform for evaluating multi-modal perception and reasoning in safety-critical scenarios. Benchmarking reveals that state-of-the-art Vision-Language Models struggle with risk assessment and navigational tasks, but fine-tuning on mmWalk improves performance and transfers to real-world VQA (EgoTextVQA), underscoring the dataset’s practical value. Overall, mmWalk offers a valuable resource for developing safer, more context-aware assistive technologies and evaluating multi-view VLMs in embodied perception settings.

Abstract

Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.

Paper Structure

This paper contains 20 sections, 1 equation, 12 figures, 15 tables.

Figures (12)

  • Figure 1: Visualization of a challenging walking scenario of the established mmWalk dataset. While walking in a parking area (the trajectory is the marked blue path), corner cases include narrow path and uneven road. The navigational landmarks for BLV include pole, traffic light, and entrance.
  • Figure 2: Trajectory numbers of each scenario (left) and weather (right).
  • Figure 3: Workflow of VQA triplets generation.
  • Figure 4: Qualitative examples of output from finetuned InternVL2-8B.
  • Figure 5: Overview of Carla map in Spectator perspective.
  • ...and 7 more figures