Memory-Maze: Scenario Driven Visual Language Navigation Benchmark for Guiding Blind People
Masaki Kuribayashi, Kohei Uehara, Allan Wang, Daisuke Sato, Simon Chu, Shigeo Morishima
TL;DR
Memory-Maze addresses a critical gap in vision-language navigation by evaluating how robots interpret memory-derived route instructions from sighted passersby to guide blind users in maze-like public spaces. The authors build a CARLA-based benchmark with two instruction data sets collected from memory and think-aloud, and propose a baseline VLN model that uses a single-inference LLM to generate navigation code via a modular API. Experimental results show memory-based instructions are longer, more varied, and harder for existing models, and that the proposed single-inference code generation approach can outperform state-of-the-art NavGPT and NaVid in this setting. The work highlights the importance of studying memory-driven language for assistive navigation and suggests directions for adaptive maps, interactive guidance, and data augmentation to broaden real world applicability.
Abstract
Visual Language Navigation (VLN) powered robots have the potential to guide blind people by understanding route instructions provided by sighted passersby. This capability allows robots to operate in environments often unknown a prior. Existing VLN models are insufficient for the scenario of navigation guidance for blind people, as they need to understand routes described from human memory, which frequently contains stutters, errors, and omissions of details, as opposed to those obtained by thinking out loud, such as in the R2R dataset. However, existing benchmarks do not contain instructions obtained from human memory in natural environments. To this end, we present our benchmark, Memory-Maze, which simulates the scenario of seeking route instructions for guiding blind people. Our benchmark contains a maze-like structured virtual environment and novel route instruction data from human memory. Our analysis demonstrates that instruction data collected from memory was longer and contained more varied wording. We further demonstrate that addressing errors and ambiguities from memory-based instructions is challenging, by evaluating state-of-the-art models alongside our baseline model with modularized perception and controls.
