VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models

Yi Zhao; Yilin Zhang; Rong Xiang; Jing Li; Hillming Li

VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models

Yi Zhao, Yilin Zhang, Rong Xiang, Jing Li, Hillming Li

TL;DR

This paper defines the task of Visual Impaired Assistance with Language Models (VIALM) to assess how large models can provide environment-grounded, step-by-step guidance for visually impaired users. It offers a comprehensive survey of Large Language Models, Large Vision-Language Models, and embodied agents, and introduces a 200-sample VIALM benchmark across home and supermarket environments. Six end-to-end VLMs plus GPT-4 are evaluated in zero-shot VIA, revealing two key gaps: limited environment grounding (25.7% not grounded for GPT-4) and insufficient fine-grained guidance (32.1% not fine-grained), with tactile guidance still largely lacking. The work suggests advancing visual grounding, incorporating tactile modalities, and fostering stronger multimodal synergy, while providing open-source resources to accelerate future VIA research.

Abstract

Visually Impaired Assistance (VIA) aims to automatically help the visually impaired (VI) handle daily activities. The advancement of VIA primarily depends on developments in Computer Vision (CV) and Natural Language Processing (NLP), both of which exhibit cutting-edge paradigms with large models (LMs). Furthermore, LMs have shown exceptional multimodal abilities to tackle challenging physically-grounded tasks such as embodied robots. To investigate the potential and limitations of state-of-the-art (SOTA) LMs' capabilities in VIA applications, we present an extensive study for the task of VIA with LMs (VIALM). In this task, given an image illustrating the physical environments and a linguistic request from a VI user, VIALM aims to output step-by-step guidance to assist the VI user in fulfilling the request grounded in the environment. The study consists of a survey reviewing recent LM research and benchmark experiments examining selected LMs' capabilities in VIA. The results indicate that while LMs can potentially benefit VIA, their output cannot be well environment-grounded (i.e., 25.7% GPT-4's responses) and lacks fine-grained guidance (i.e., 32.1% GPT-4's responses).

VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models

TL;DR

Abstract

Paper Structure (30 sections, 5 figures, 6 tables)

This paper contains 30 sections, 5 figures, 6 tables.

Introduction
Large Models
Large Language Models
Pretraining Tasks.
Model Architectures.
Instruction-tuned LLMs.
Large Vision-Language Models
Visual Components.
LLM Components.
Cross-Modal Connectors.
Training Methods.
Embodied Agents
LLM Components.
Environments and Tasks.
LLM-based Agents.
...and 15 more sections

Figures (5)

Figure 1: A sample input and output of VIALM. Its input is a pair of a visual image of the environment (the left image) and a user request in language (the grey box). The yellow box shows the output guidance for VI users to complete the request within the environment (the right image). The output should convey environment-grounded information (blue words) and fine-grained (green words) guidance, integrating tactile support (the last sentence) for VI users.
Figure 2: Timeline of LMs. Release times are based on the publication dates of their respective descriptive papers on arXiv. LLMs are marked in black, large VLMs in blue, and embodied agents in green. It is observed the years 2021 and 2022 saw LLM advancements, followed by a 2023 surge in VLMs and embodied agent development.
Figure 3: Evaluation Data Sample.The test dataset is in the format of a VQA dataset. The question indicates that the VI user is searching for a toilet. The output answer offers detailed guidance related to this environment to assist the user in achieving this objective.
Figure 4: Evaluation Results. (a) The overall average automatic evaluation results for these six models: GPT-4, CogVLM, MiniGPT, Qwen-VL, LLaVA, and BLIVA, based on ROUGE and BERTScore metrics. (b) The human evaluation results of the six models focusing on the aspects of Correctness, Actionability, and Clarity.
Figure 5: An example of predictions from the top three models. GPT-4 is prone to generating redundant output describing general situations. CogVLM exhibits advantages in image grounding, while LLaVA emphasizes providing step-by-step guidance for VI users.

VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models

TL;DR

Abstract

VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)