mmWalk: Towards Multi-modal Multi-view Walking Assistance

Kedi Ying; Ruiping Liu; Chongyan Chen; Mingzhe Tao; Hao Shi; Kailun Yang; Jiaming Zhang; Rainer Stiefelhagen

mmWalk: Towards Multi-modal Multi-view Walking Assistance

Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

TL;DR

mmWalk introduces a simulated, multi-view, multi-modal walking dataset and an accompanying VQA benchmark (mmWalkVQA) tailored for outdoor navigation assistance for BLV users. By capturing synchronized panomoric frames from walker, drone, and guide-dog perspectives across RGB, depth, and semantic modalities, and by annotating corner cases and navigational landmarks, the work provides a rich platform for evaluating multi-modal perception and reasoning in safety-critical scenarios. Benchmarking reveals that state-of-the-art Vision-Language Models struggle with risk assessment and navigational tasks, but fine-tuning on mmWalk improves performance and transfers to real-world VQA (EgoTextVQA), underscoring the dataset’s practical value. Overall, mmWalk offers a valuable resource for developing safer, more context-aware assistive technologies and evaluating multi-view VLMs in embodied perception settings.

Abstract

Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.

mmWalk: Towards Multi-modal Multi-view Walking Assistance

TL;DR

Abstract

mmWalk: Towards Multi-modal Multi-view Walking Assistance

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)