Table of Contents
Fetching ...

SceneScout: Towards AI Agent-driven Access to Street View Imagery for Blind Users

Gaurav Jain, Leah Findlater, Cole Gleason

TL;DR

SceneScout introduces a multimodal AI agent that makes street view imagery accessible to blind or low-vision users by enabling pre-travel Route Preview and open-ended Virtual Exploration. Grounded in Apple Maps Street View data and GPT-4o reasoning, the system generates personalized textual descriptions that users access through an accessible web interface. A mixed-methods user study (N=10) and a technical evaluation show that descriptions are largely accurate and temporally stable, though they exhibit occasional plausible errors and limited spatial precision, raising trust and safety considerations. The work discusses personalization at scale, integration of map metadata with street view imagery, and pedestrian-oriented design to inform future, more reliable, accessible navigation experiences.

Abstract

People who are blind or have low vision (BLV) may hesitate to travel independently in unfamiliar environments due to uncertainty about the physical landscape. While most tools focus on in-situ navigation, those exploring pre-travel assistance typically provide only landmarks and turn-by-turn instructions, lacking detailed visual context. Street view imagery, which contains rich visual information and has the potential to reveal numerous environmental details, remains inaccessible to BLV people. In this work, we introduce SceneScout, a multimodal large language model (MLLM)-driven AI agent that enables accessible interactions with street view imagery. SceneScout supports two modes: (1) Route Preview, enabling users to familiarize themselves with visual details along a route, and (2) Virtual Exploration, enabling free movement within street view imagery. Our user study (N=10) demonstrates that SceneScout helps BLV users uncover visual information otherwise unavailable through existing means. A technical evaluation shows that most descriptions are accurate (72%) and describe stable visual elements (95%) even in older imagery, though occasional subtle and plausible errors make them difficult to verify without sight. We discuss future opportunities and challenges of using street view imagery to enhance navigation experiences.

SceneScout: Towards AI Agent-driven Access to Street View Imagery for Blind Users

TL;DR

SceneScout introduces a multimodal AI agent that makes street view imagery accessible to blind or low-vision users by enabling pre-travel Route Preview and open-ended Virtual Exploration. Grounded in Apple Maps Street View data and GPT-4o reasoning, the system generates personalized textual descriptions that users access through an accessible web interface. A mixed-methods user study (N=10) and a technical evaluation show that descriptions are largely accurate and temporally stable, though they exhibit occasional plausible errors and limited spatial precision, raising trust and safety considerations. The work discusses personalization at scale, integration of map metadata with street view imagery, and pedestrian-oriented design to inform future, more reliable, accessible navigation experiences.

Abstract

People who are blind or have low vision (BLV) may hesitate to travel independently in unfamiliar environments due to uncertainty about the physical landscape. While most tools focus on in-situ navigation, those exploring pre-travel assistance typically provide only landmarks and turn-by-turn instructions, lacking detailed visual context. Street view imagery, which contains rich visual information and has the potential to reveal numerous environmental details, remains inaccessible to BLV people. In this work, we introduce SceneScout, a multimodal large language model (MLLM)-driven AI agent that enables accessible interactions with street view imagery. SceneScout supports two modes: (1) Route Preview, enabling users to familiarize themselves with visual details along a route, and (2) Virtual Exploration, enabling free movement within street view imagery. Our user study (N=10) demonstrates that SceneScout helps BLV users uncover visual information otherwise unavailable through existing means. A technical evaluation shows that most descriptions are accurate (72%) and describe stable visual elements (95%) even in older imagery, though occasional subtle and plausible errors make them difficult to verify without sight. We discuss future opportunities and challenges of using street view imagery to enhance navigation experiences.

Paper Structure

This paper contains 66 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: The Route Preview interaction mode in SceneScout, enabling BLV users to familiarize themselves with a route before traveling. On the left, an illustration shows the AI agent navigating the street view imagery along the route to a nearby bus stop, while on the right is SceneScout’s web interface. BLV users set a start and destination, which triggers the agent to automatically retrieve relevant street view imagery (A$_1$–B$_1$) and generate step-by-step descriptions (A$_2$–B$_2$) along the route. Finally, it provides a detailed visual description of the destination (C$_2$) based on the destination's street view images (C$_1$), assisting users with last-few-meters wayfinding. Appendix \ref{['sec:route_preview_ui_text']} includes the web interface’s text in an accessible format.
  • Figure 2: The Virtual Exploration interaction mode in SceneScout, enabling BLV users to freely explore street view. On the left, an illustration shows the AI agent's (C) movement within street view, while on the right is SceneScout's web interface. BLV users first specify their intent (A) and relevant keywords (B), which guide the descriptions generated from street view imagery. At intersections, users receive descriptions (D$_2$--F$_2$) of each possible direction (D$_1$--F$_1$) and select where to explore next. The agent then moves accordingly, generating step-by-step descriptions tailored to the user’s intent, creating an interactive and personalized exploration experience. Appendix \ref{['sec:virtual_exploration_ui_text']} includes the web interface's text in an accessible format.
  • Figure 3: System architecture of SceneScout, an MLLM-driven AI agent for accessible street view interactions. The agent grounds itself in the real world using geographic coordinates (i.e., geocodes) and retrieves street view imagery, routes, and POI data via Apple Maps APIs apple_maps_server_api. BLV users’ preferences—such as intent and accessibility needs—are processed alongside map data using GPT-4o, enabling SceneScout to generate textual descriptions. The web interface presents this information to BLV users.
  • Figure 4: Participants' average ratings ($N=10$) for perceived relevance and usefulness of descriptions from SceneScout's two interaction modes. While both modes received positive ratings, descriptions from Virtual Exploration were found to be slightly more relevant and useful compared to those from Route Previews. Error bars indicate standard error.
  • Figure 5: Participants’ average ratings ($N=10$) for perceived trust in descriptions and confidence in navigation across SceneScout's two interaction modes. Descriptions from Virtual Exploration were trusted slightly more than those from Route Preview. Both modes instilled a similar level of confidence in participants to navigate based on the information provided. Error bars indicate standard error.
  • ...and 6 more figures