Generating Contextually-Relevant Navigation Instructions for Blind and Low Vision People
Zain Merchant, Abrar Anwar, Emily Wang, Souti Chattopadhyay, Jesse Thomason
TL;DR
Contextually-relevant navigation assistance for blind and low-vision users is addressed by forming a grounded instruction generation framework that leverages single-image observations and goal context. The authors create a VizWiz-based dataset of 48 image-goal pairs across indoor and outdoor settings and compare four instruction-generation approaches, including human, template, LLM-based Socratic prompting, and VLM-based generation. Across sighted and BLV user studies, LLM- and VLM-generated instructions show correctness and usefulness comparable to human-authored instructions, with user preferences varying by environment and task difficulty. The work highlights benefits and risks of deploying such generative systems—particularly hallucinations, bias, and the need for context-aware, user-tailored prompts—and points to ethical considerations and design directions for real-world assistive navigation.
Abstract
Navigating unfamiliar environments presents significant challenges for blind and low-vision (BLV) individuals. In this work, we construct a dataset of images and goals across different scenarios such as searching through kitchens or navigating outdoors. We then investigate how grounded instruction generation methods can provide contextually-relevant navigational guidance to users in these instances. Through a sighted user study, we demonstrate that large pretrained language models can produce correct and useful instructions perceived as beneficial for BLV users. We also conduct a survey and interview with 4 BLV users and observe useful insights on preferences for different instructions based on the scenario.
