Table of Contents
Fetching ...

EnVisionVR: A Scene Interpretation Tool for Visual Accessibility in Virtual Reality

Junlong Chen, Rosella P. Galindo Esparza, Vanja Garaj, Per Ola Kristensson, John Dudley

TL;DR

EnVisionVR tackles the essential problem of VR accessibility for Blind and Low Vision users by coupling Vision Language Models with voice, audio, and haptic feedback to interpret scenes and localize objects. The authors conduct a formative study to identify barriers and then implement a retrofit framework consisting of Scene Description, Main Objects Indication, and Object Localization, driven by speech commands. In a 12-participant evaluation, EnVisionVR improves object localization and object interaction compared with a baseline without accessibility features, while scene understanding shows mixed results but overall positive user reception and actionable design insights. The work provides a concrete proof-of-concept, practical design guidelines for VLM-powered accessibility in VR, and a path toward more inclusive immersive experiences.

Abstract

Effective visual accessibility in Virtual Reality (VR) is crucial for Blind and Low Vision (BLV) users. However, designing visual accessibility systems is challenging due to the complexity of 3D VR environments and the need for techniques that can be easily retrofitted into existing applications. While prior work has studied how to enhance or translate visual information, the advancement of Vision Language Models (VLMs) provides an exciting opportunity to advance the scene interpretation capability of current systems. This paper presents EnVisionVR, an accessibility tool for VR scene interpretation. Through a formative study of usability barriers, we confirmed the lack of visual accessibility features as a key barrier for BLV users of VR content and applications. In response, we designed and developed EnVisionVR, a novel visual accessibility system leveraging a VLM, voice input and multimodal feedback for scene interpretation and virtual object interaction in VR. An evaluation with 12 BLV users demonstrated that EnVisionVR significantly improved their ability to locate virtual objects, effectively supporting scene understanding and object interaction.

EnVisionVR: A Scene Interpretation Tool for Visual Accessibility in Virtual Reality

TL;DR

EnVisionVR tackles the essential problem of VR accessibility for Blind and Low Vision users by coupling Vision Language Models with voice, audio, and haptic feedback to interpret scenes and localize objects. The authors conduct a formative study to identify barriers and then implement a retrofit framework consisting of Scene Description, Main Objects Indication, and Object Localization, driven by speech commands. In a 12-participant evaluation, EnVisionVR improves object localization and object interaction compared with a baseline without accessibility features, while scene understanding shows mixed results but overall positive user reception and actionable design insights. The work provides a concrete proof-of-concept, practical design guidelines for VLM-powered accessibility in VR, and a path toward more inclusive immersive experiences.

Abstract

Effective visual accessibility in Virtual Reality (VR) is crucial for Blind and Low Vision (BLV) users. However, designing visual accessibility systems is challenging due to the complexity of 3D VR environments and the need for techniques that can be easily retrofitted into existing applications. While prior work has studied how to enhance or translate visual information, the advancement of Vision Language Models (VLMs) provides an exciting opportunity to advance the scene interpretation capability of current systems. This paper presents EnVisionVR, an accessibility tool for VR scene interpretation. Through a formative study of usability barriers, we confirmed the lack of visual accessibility features as a key barrier for BLV users of VR content and applications. In response, we designed and developed EnVisionVR, a novel visual accessibility system leveraging a VLM, voice input and multimodal feedback for scene interpretation and virtual object interaction in VR. An evaluation with 12 BLV users demonstrated that EnVisionVR significantly improved their ability to locate virtual objects, effectively supporting scene understanding and object interaction.

Paper Structure

This paper contains 36 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Examples of functions supported by EnVisionVR to enhance the accessibility of VR experiences for BLV users. Left: The user can ask "Where am I?" and the EnVisionVR system reads out a detailed description of the user's current field of view. Middle: The user can ask "What is near me?" and the system reads out the names of the three main objects near the user with a spatial tone to indicate the object's location. Right: The user can ask "Where is the Brew Button?" and the system uses a beeping sound and directional instructions to communicate the distance to the Brew Button. When the user reaches the Brew Button, the controller vibrates to inform the user.
  • Figure 2: Overview of the Scene Description Function. Scene description is provided in two steps. In Step 1, camera anchor positions are determined by the developer or automatically by the system. Screenshots of the field of view of these anchor points with orientations of 0, 45, ..., 315 degrees along the horizontal plane together with a textual prompt are fed into GPT-4o to generate pre-baked scene descriptions. In Step 2 during runtime, we match the current camera position and orientation with the closest-matching anchor position and orientation to read out the pre-baked descriptions via the Microsoft text-to-speech (TTS) service.
  • Figure 3: Top-down view of camera anchor positions in a VR escape room and the user field of view in eight directions for each anchor point (left). At each field of view, a screenshot is taken to generate the pre-baked scene description. Example field-of-view screenshots taken at Anchor 3 are provided in the bottom images.
  • Figure 4: Scene Understanding Task: Performance of all participants (left) and the difference between scores in the EVR and NVR condition for each participant (right). Participants with blindness and severe visual impairment who regularly use assistive technology are colored in black, while others are colored in grey. Vertical jittering is applied to visualize all points. Participant IDs are labelled beside each scatter point.
  • Figure 5: Scene Understanding Task: Distribution of the perceived difficulty (higher score indicates lower perceived difficulty) for the NVR and EVR conditions for participants who regularly use assistive technology and for those who do not. Black squares indicate the mean value.
  • ...and 6 more figures