Table of Contents
Fetching ...

WalkVLM:Aid Visually Impaired People Walking by Vision Language Model

Zhiqiang Yuan, Ting Zhang, Ying Deng, Jiapei Zhang, Yeshuang Zhu, Zexi Jia, Jie Zhou, Jinchao Zhang

TL;DR

WalkVLM introduces a large-scale Walking Awareness Dataset (WAD) and a streaming vision-language model designed for visually impaired walkers. By employing chain-of-thought based hierarchical planning and a temporal-aware adaptive predictor, WalkVLM generates concise, timely reminders and answers QA queries with reduced temporal redundancy. The approach establishes a standardized benchmark for blind walking and demonstrates superior performance against existing VLMs on both quantitative and qualitative metrics. The work advances practical assistive AI by enabling more reliable, real-time guidance and opening avenues for broader research and real-world deployment.

Abstract

Approximately 200 million individuals around the world suffer from varying degrees of visual impairment, making it crucial to leverage AI technology to offer walking assistance for these people. With the recent progress of vision-language models (VLMs), applying VLMs to offer walking guidance has become popular. However, the existing methods of walking guidance are mainly based on self-curated question-answering datasets that are not publicly accessible, without a standardized benchmark for training or evaluation. Moreover, walking assistance often requires real-time streaming video analysis and the generation of concise yet informative reminders, making VLMs struggle due to excessive responses and low efficiency in inferences. In this paper, we introduce the first large-scale dataset dedicated to walking assistance, comprising 12,000 video-annotation pairs, to provide a unified benchmark for training and evaluating systems to help visually-impaired individuals walk. Furthermore, a WalkVLM model is proposed, which employs chain of thought for hierarchical planning to generate concise but informative reminders and utilizes temporal-aware adaptive prediction to reduce the temporal redundancy of reminders. Finally, we have established a solid benchmark for blind walking task and verified the advantages of WalkVLM in stream video processing for this task compared to other VLMs. Our dataset and code are available at https://walkvlm2024.github.io.

WalkVLM:Aid Visually Impaired People Walking by Vision Language Model

TL;DR

WalkVLM introduces a large-scale Walking Awareness Dataset (WAD) and a streaming vision-language model designed for visually impaired walkers. By employing chain-of-thought based hierarchical planning and a temporal-aware adaptive predictor, WalkVLM generates concise, timely reminders and answers QA queries with reduced temporal redundancy. The approach establishes a standardized benchmark for blind walking and demonstrates superior performance against existing VLMs on both quantitative and qualitative metrics. The work advances practical assistive AI by enabling more reliable, real-time guidance and opening avenues for broader research and real-world deployment.

Abstract

Approximately 200 million individuals around the world suffer from varying degrees of visual impairment, making it crucial to leverage AI technology to offer walking assistance for these people. With the recent progress of vision-language models (VLMs), applying VLMs to offer walking guidance has become popular. However, the existing methods of walking guidance are mainly based on self-curated question-answering datasets that are not publicly accessible, without a standardized benchmark for training or evaluation. Moreover, walking assistance often requires real-time streaming video analysis and the generation of concise yet informative reminders, making VLMs struggle due to excessive responses and low efficiency in inferences. In this paper, we introduce the first large-scale dataset dedicated to walking assistance, comprising 12,000 video-annotation pairs, to provide a unified benchmark for training and evaluating systems to help visually-impaired individuals walk. Furthermore, a WalkVLM model is proposed, which employs chain of thought for hierarchical planning to generate concise but informative reminders and utilizes temporal-aware adaptive prediction to reduce the temporal redundancy of reminders. Finally, we have established a solid benchmark for blind walking task and verified the advantages of WalkVLM in stream video processing for this task compared to other VLMs. Our dataset and code are available at https://walkvlm2024.github.io.
Paper Structure (37 sections, 21 figures, 12 tables)

This paper contains 37 sections, 21 figures, 12 tables.

Figures (21)

  • Figure 1: WalkVLM delivers timely, concise, informative walking reminders for visually impaired individuals by utilizing hierarchical planning and temporal-aware adaptive prediction.
  • Figure 2: The data annotation pipeline for constructing the Walking Awareness Dataset (WAD). Appendix A.5 provides more examples to illustrate the diversity of WAD dataset.
  • Figure 3: Blind test experiment for analyzing the most critical information needed by users in blind walking. We required two individuals to collaborate as a team, where the participant at the rear provided directions to enable the individual at the front to arrive at a specific location safely in the absence of any visual information.
  • Figure 4: Visualization of six scenarios that require reminders, which were summarized through multiple blind experiments.
  • Figure 5: Visualization of the Walking Awareness Dataset. Each sample contains a video clip and multiple annotations, with the hierarchy divided into perception, comprehension, and decision.
  • ...and 16 more figures