Table of Contents
Fetching ...

Memory-Maze: Scenario Driven Visual Language Navigation Benchmark for Guiding Blind People

Masaki Kuribayashi, Kohei Uehara, Allan Wang, Daisuke Sato, Simon Chu, Shigeo Morishima

TL;DR

Memory-Maze addresses a critical gap in vision-language navigation by evaluating how robots interpret memory-derived route instructions from sighted passersby to guide blind users in maze-like public spaces. The authors build a CARLA-based benchmark with two instruction data sets collected from memory and think-aloud, and propose a baseline VLN model that uses a single-inference LLM to generate navigation code via a modular API. Experimental results show memory-based instructions are longer, more varied, and harder for existing models, and that the proposed single-inference code generation approach can outperform state-of-the-art NavGPT and NaVid in this setting. The work highlights the importance of studying memory-driven language for assistive navigation and suggests directions for adaptive maps, interactive guidance, and data augmentation to broaden real world applicability.

Abstract

Visual Language Navigation (VLN) powered robots have the potential to guide blind people by understanding route instructions provided by sighted passersby. This capability allows robots to operate in environments often unknown a prior. Existing VLN models are insufficient for the scenario of navigation guidance for blind people, as they need to understand routes described from human memory, which frequently contains stutters, errors, and omissions of details, as opposed to those obtained by thinking out loud, such as in the R2R dataset. However, existing benchmarks do not contain instructions obtained from human memory in natural environments. To this end, we present our benchmark, Memory-Maze, which simulates the scenario of seeking route instructions for guiding blind people. Our benchmark contains a maze-like structured virtual environment and novel route instruction data from human memory. Our analysis demonstrates that instruction data collected from memory was longer and contained more varied wording. We further demonstrate that addressing errors and ambiguities from memory-based instructions is challenging, by evaluating state-of-the-art models alongside our baseline model with modularized perception and controls.

Memory-Maze: Scenario Driven Visual Language Navigation Benchmark for Guiding Blind People

TL;DR

Memory-Maze addresses a critical gap in vision-language navigation by evaluating how robots interpret memory-derived route instructions from sighted passersby to guide blind users in maze-like public spaces. The authors build a CARLA-based benchmark with two instruction data sets collected from memory and think-aloud, and propose a baseline VLN model that uses a single-inference LLM to generate navigation code via a modular API. Experimental results show memory-based instructions are longer, more varied, and harder for existing models, and that the proposed single-inference code generation approach can outperform state-of-the-art NavGPT and NaVid in this setting. The work highlights the importance of studying memory-driven language for assistive navigation and suggests directions for adaptive maps, interactive guidance, and data augmentation to broaden real world applicability.

Abstract

Visual Language Navigation (VLN) powered robots have the potential to guide blind people by understanding route instructions provided by sighted passersby. This capability allows robots to operate in environments often unknown a prior. Existing VLN models are insufficient for the scenario of navigation guidance for blind people, as they need to understand routes described from human memory, which frequently contains stutters, errors, and omissions of details, as opposed to those obtained by thinking out loud, such as in the R2R dataset. However, existing benchmarks do not contain instructions obtained from human memory in natural environments. To this end, we present our benchmark, Memory-Maze, which simulates the scenario of seeking route instructions for guiding blind people. Our benchmark contains a maze-like structured virtual environment and novel route instruction data from human memory. Our analysis demonstrates that instruction data collected from memory was longer and contained more varied wording. We further demonstrate that addressing errors and ambiguities from memory-based instructions is challenging, by evaluating state-of-the-art models alongside our baseline model with modularized perception and controls.
Paper Structure (22 sections, 4 figures, 3 tables)

This paper contains 22 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Memory-Maze Benchmark. Top: the instructions obtained in the memory-based scenario contain unique phrases, highlighted in green, in contrast to those collected in traditional think-out-loud settings. Middle: Our benchmark environment based on the CARLA simulator Dosovitskiy17. Bottom: the VLN agent that navigates within the environment.
  • Figure 2: Bird's-Eye Views of Memory-Maze. The benchmark contains three environments. The university includes features such as classrooms, offices, hallways, a kitchen, and a library. The 5th floor of the museum mainly contains exhibits. The 7th floor contains conference rooms, hallways, and a terrace area. Each environment includes two routes, totaling six routes. In the on-site study, participants were asked to describe the route from the starting point to the end point, thus, their descriptions may vary from the visualized route.
  • Figure 3: Word Clouds. The onsite instruction data contains unique phrases that come from talking while recalling from the memory, such as "uh,""maybe," and "okay."
  • Figure 4: Method Overview. Given a set of instructions from a sighted passerby, the LLM first parses it into an itemized format. Then, combined with the API specification, the LLM generates Python code directly to control the robot, which runs in the virtual environment using the simulated sensor inputs.