Generating Contextually-Relevant Navigation Instructions for Blind and Low Vision People

Zain Merchant; Abrar Anwar; Emily Wang; Souti Chattopadhyay; Jesse Thomason

Generating Contextually-Relevant Navigation Instructions for Blind and Low Vision People

Zain Merchant, Abrar Anwar, Emily Wang, Souti Chattopadhyay, Jesse Thomason

TL;DR

Contextually-relevant navigation assistance for blind and low-vision users is addressed by forming a grounded instruction generation framework that leverages single-image observations and goal context. The authors create a VizWiz-based dataset of 48 image-goal pairs across indoor and outdoor settings and compare four instruction-generation approaches, including human, template, LLM-based Socratic prompting, and VLM-based generation. Across sighted and BLV user studies, LLM- and VLM-generated instructions show correctness and usefulness comparable to human-authored instructions, with user preferences varying by environment and task difficulty. The work highlights benefits and risks of deploying such generative systems—particularly hallucinations, bias, and the need for context-aware, user-tailored prompts—and points to ethical considerations and design directions for real-world assistive navigation.

Abstract

Navigating unfamiliar environments presents significant challenges for blind and low-vision (BLV) individuals. In this work, we construct a dataset of images and goals across different scenarios such as searching through kitchens or navigating outdoors. We then investigate how grounded instruction generation methods can provide contextually-relevant navigational guidance to users in these instances. Through a sighted user study, we demonstrate that large pretrained language models can produce correct and useful instructions perceived as beneficial for BLV users. We also conduct a survey and interview with 4 BLV users and observe useful insights on preferences for different instructions based on the scenario.

Generating Contextually-Relevant Navigation Instructions for Blind and Low Vision People

TL;DR

Abstract

Paper Structure (14 sections, 3 figures, 3 tables)

This paper contains 14 sections, 3 figures, 3 tables.

INTRODUCTION AND BACKGROUND
PROBLEM SETTING
Grounded Instruction Generation
Study Design
Sighted User Study
BLV User Study
Results
Sighted Survey Results
BLV Survey Results
BLV Qualitative Interview
Conclusion, Ethics, and Limitations
LLM Prompt
VLM Prompt:
Semi-structured Interview Questions:

Figures (3)

Figure 1: We formulate the problem of providing contextually-relevant navigational instructions to blind and low vision (BLV) people as a grounded instruction generation task, which we then evaluate with sighted and BLV participants in a user study.
Figure 2: Left: We select 48 images from indoor and outdoor environments in VizWiz 8578478 and annotate them with navigation goals. Middle: We design three instruction generation methods, described further in Section \ref{['methods']}. Right: These generated instructions are then evaluated in a user study with sighted and BLV participants.
Figure 3: Sighted participant Usefulness ratings over the generated instructions for 48 image-goal pairs across four methods separated by environment. VLM-based instructions had similar ratings across environments to humans. The LLM-based model was rated slightly less useful.

Generating Contextually-Relevant Navigation Instructions for Blind and Low Vision People

TL;DR

Abstract

Generating Contextually-Relevant Navigation Instructions for Blind and Low Vision People

Authors

TL;DR

Abstract

Table of Contents

Figures (3)