Table of Contents
Fetching ...

Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision

Yu Li, Yuchen Zheng, Giles Hamilton-Fletcher, Marco Mezzavilla, Yao Wang, Sundeep Rangan, Maurizio Porfiri, Zhou Yu, John-Ross Rizzo

Abstract

This paper investigates the potential of vision-language models (VLMs) to assist people with blindness and low vision (pBLV) in navigation tasks. We evaluate state-of-the-art closed-source models, including GPT-4V, GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, alongside open-source models, such as Llava-v1.6-mistral and Llava-onevision-qwen, to analyze their capabilities in foundational visual skills: counting ambient obstacles, relative spatial reasoning, and common-sense wayfinding-pertinent scene understanding. We further assess their performance in navigation scenarios, using pBLV-specific prompts designed to simulate real-world assistance tasks. Our findings reveal notable performance disparities between these models: GPT-4o consistently outperforms others across all tasks, particularly in spatial reasoning and scene understanding. In contrast, open-source models struggle with nuanced reasoning and adaptability in complex environments. Common challenges include difficulties in accurately counting objects in cluttered settings, biases in spatial reasoning, and a tendency to prioritize object details over spatial feedback, limiting their usability for pBLV in navigation tasks. Despite these limitations, VLMs show promise for wayfinding assistance when better aligned with human feedback and equipped with improved spatial reasoning. This research provides actionable insights into the strengths and limitations of current VLMs, guiding developers on effectively integrating VLMs into assistive technologies while addressing key limitations for enhanced usability.

Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision

Abstract

This paper investigates the potential of vision-language models (VLMs) to assist people with blindness and low vision (pBLV) in navigation tasks. We evaluate state-of-the-art closed-source models, including GPT-4V, GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, alongside open-source models, such as Llava-v1.6-mistral and Llava-onevision-qwen, to analyze their capabilities in foundational visual skills: counting ambient obstacles, relative spatial reasoning, and common-sense wayfinding-pertinent scene understanding. We further assess their performance in navigation scenarios, using pBLV-specific prompts designed to simulate real-world assistance tasks. Our findings reveal notable performance disparities between these models: GPT-4o consistently outperforms others across all tasks, particularly in spatial reasoning and scene understanding. In contrast, open-source models struggle with nuanced reasoning and adaptability in complex environments. Common challenges include difficulties in accurately counting objects in cluttered settings, biases in spatial reasoning, and a tendency to prioritize object details over spatial feedback, limiting their usability for pBLV in navigation tasks. Despite these limitations, VLMs show promise for wayfinding assistance when better aligned with human feedback and equipped with improved spatial reasoning. This research provides actionable insights into the strengths and limitations of current VLMs, guiding developers on effectively integrating VLMs into assistive technologies while addressing key limitations for enhanced usability.
Paper Structure (32 sections, 7 figures, 8 tables)

This paper contains 32 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Evaluation examples for the fundamental counting task, images feature one to six chairs, with varying arrangements for scenarios involving three and four chairs.
  • Figure 2: Evaluation examples for the fundamental relative spatial reasoning task: Images feature chairs at varying distances from the viewpoint, with both different and identical chair styles. Cases 1 and 2 are also flipped to investigate potential biases in model predictions.
  • Figure 3: Spatial reasoning keyword analysis: The x-axis represents the correct answer, while the bars indicate the ratio of responses extracted from model outputs. Dark green bars denote optimal answers for pBLV applications, using spatial terms (e.g., "left", "right") to describe positional relationships. Light green bars represent suboptimal answers that rely on color-based descriptions (e.g., "orange", "yellow") but demonstrate correct spatial reasoning.
  • Figure 4: Evaluation examples for the fundamental commonsense reasoning task: Images show chairs that are either occupied or available, with various objects such as coats hanging on the chair or laptops placed on surfaces to indicate occupancy.
  • Figure 5: Evaluation examples for navigation tasks: Scenarios include obstacles placed along the path to the chair or objects placed on the chair and corresponding tabletop. Obstacle and object types, such as backpacks, coats, and boxes, simulate complex real-life scenarios to assess the model’s ability to identify empty chairs, plan reasonable navigation paths, and avoid obstacles effectively.
  • ...and 2 more figures