A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction

Yu Hao; Fan Yang; Hao Huang; Shuaihang Yuan; Sundeep Rangan; John-Ross Rizzo; Yao Wang; Yi Fang

A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction

Yu Hao, Fan Yang, Hao Huang, Shuaihang Yuan, Sundeep Rangan, John-Ross Rizzo, Yao Wang, Yi Fang

TL;DR

The paper presents VisPercep, a multi-modal foundation model designed to assist people with blindness and low vision (pBLV) in environmental interaction by fusing an image tagging module (RAM) with a vision-language model (InstructBLIP) through prompt engineering. It addresses critical challenges in scene understanding, object localization, and hazard risk assessment in unfamiliar settings. The approach delivers detailed environmental descriptions and hazard warnings, with validated results on indoor/outdoor datasets (Visual7W, VizWiz) and real-world tests, showing fast inference and high helpfulness in pBLV scenarios. This work advances accessible navigation by enabling natural language–guided querying and robust, context-aware guidance.

Abstract

People with blindness and low vision (pBLV) encounter substantial challenges when it comes to comprehensive scene recognition and precise object identification in unfamiliar environments. Additionally, due to the vision loss, pBLV have difficulty in accessing and identifying potential tripping hazards on their own. In this paper, we present a pioneering approach that leverages a large vision-language model to enhance visual perception for pBLV, offering detailed and comprehensive descriptions of the surrounding environments and providing warnings about the potential risks. Our method begins by leveraging a large image tagging model (i.e., Recognize Anything (RAM)) to identify all common objects present in the captured images. The recognition results and user query are then integrated into a prompt, tailored specifically for pBLV using prompt engineering. By combining the prompt and input image, a large vision-language model (i.e., InstructBLIP) generates detailed and comprehensive descriptions of the environment and identifies potential risks in the environment by analyzing the environmental objects and scenes, relevant to the prompt. We evaluate our approach through experiments conducted on both indoor and outdoor datasets. Our results demonstrate that our method is able to recognize objects accurately and provide insightful descriptions and analysis of the environment for pBLV.

A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction

TL;DR

Abstract

Paper Structure (16 sections, 1 equation, 7 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 1 equation, 7 figures, 2 tables, 1 algorithm.

Introduction
Related Works
Materials and Methods
Image Tagging Module
Prompt Engineering for pBLV
Vision-Language Module
Experiments
Implementation Details
Tests on Visual7W Dataset
Qualitative Performance Analysis for pBLV
Quantitative Analysis of Inference Time and Helpfulness Scoring for pBLV
Ablation Study
Tests on VizWiz Dataset
Real-World Tests
Conclusions
...and 1 more sections

Figures (7)

Figure S1: Multi-Modal Foundation Model Sample Illustration.
Figure S2: Method Structure Overview.
Figure S3: Client-server architecture.
Figure S4: Examples of scene understanding (top), object localization (middle), and risk assessment (bottom) on Visual7W dataset.
Figure S5: Ablation study with different model settings on Visual7W dataset.
...and 2 more figures

A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction

TL;DR

Abstract

A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (7)