Table of Contents
Fetching ...

Generating Contextually-Relevant Navigation Instructions for Blind and Low Vision People

Zain Merchant, Abrar Anwar, Emily Wang, Souti Chattopadhyay, Jesse Thomason

TL;DR

Contextually-relevant navigation assistance for blind and low-vision users is addressed by forming a grounded instruction generation framework that leverages single-image observations and goal context. The authors create a VizWiz-based dataset of 48 image-goal pairs across indoor and outdoor settings and compare four instruction-generation approaches, including human, template, LLM-based Socratic prompting, and VLM-based generation. Across sighted and BLV user studies, LLM- and VLM-generated instructions show correctness and usefulness comparable to human-authored instructions, with user preferences varying by environment and task difficulty. The work highlights benefits and risks of deploying such generative systems—particularly hallucinations, bias, and the need for context-aware, user-tailored prompts—and points to ethical considerations and design directions for real-world assistive navigation.

Abstract

Navigating unfamiliar environments presents significant challenges for blind and low-vision (BLV) individuals. In this work, we construct a dataset of images and goals across different scenarios such as searching through kitchens or navigating outdoors. We then investigate how grounded instruction generation methods can provide contextually-relevant navigational guidance to users in these instances. Through a sighted user study, we demonstrate that large pretrained language models can produce correct and useful instructions perceived as beneficial for BLV users. We also conduct a survey and interview with 4 BLV users and observe useful insights on preferences for different instructions based on the scenario.

Generating Contextually-Relevant Navigation Instructions for Blind and Low Vision People

TL;DR

Contextually-relevant navigation assistance for blind and low-vision users is addressed by forming a grounded instruction generation framework that leverages single-image observations and goal context. The authors create a VizWiz-based dataset of 48 image-goal pairs across indoor and outdoor settings and compare four instruction-generation approaches, including human, template, LLM-based Socratic prompting, and VLM-based generation. Across sighted and BLV user studies, LLM- and VLM-generated instructions show correctness and usefulness comparable to human-authored instructions, with user preferences varying by environment and task difficulty. The work highlights benefits and risks of deploying such generative systems—particularly hallucinations, bias, and the need for context-aware, user-tailored prompts—and points to ethical considerations and design directions for real-world assistive navigation.

Abstract

Navigating unfamiliar environments presents significant challenges for blind and low-vision (BLV) individuals. In this work, we construct a dataset of images and goals across different scenarios such as searching through kitchens or navigating outdoors. We then investigate how grounded instruction generation methods can provide contextually-relevant navigational guidance to users in these instances. Through a sighted user study, we demonstrate that large pretrained language models can produce correct and useful instructions perceived as beneficial for BLV users. We also conduct a survey and interview with 4 BLV users and observe useful insights on preferences for different instructions based on the scenario.
Paper Structure (14 sections, 3 figures, 3 tables)

This paper contains 14 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: We formulate the problem of providing contextually-relevant navigational instructions to blind and low vision (BLV) people as a grounded instruction generation task, which we then evaluate with sighted and BLV participants in a user study.
  • Figure 2: Left: We select 48 images from indoor and outdoor environments in VizWiz 8578478 and annotate them with navigation goals. Middle: We design three instruction generation methods, described further in Section \ref{['methods']}. Right: These generated instructions are then evaluated in a user study with sighted and BLV participants.
  • Figure 3: Sighted participant Usefulness ratings over the generated instructions for 48 image-goal pairs across four methods separated by environment. VLM-based instructions had similar ratings across environments to humans. The LLM-based model was rated slightly less useful.