Table of Contents
Fetching ...

EgoBlind: Towards Egocentric Visual Assistance for the Blind

Junbin Xiao, Nanxin Huang, Hao Qiu, Zhulin Tao, Xun Yang, Richang Hong, Meng Wang, Angela Yao

TL;DR

EgoBlind introduces the first egocentric VideoQA dataset collected from blind individuals to benchmark multimodal LLMs on live assistive tasks. It comprises 1,329 videos and 5,311 questions across six categories, with online timestamped QA and multiple reference answers to enable robust evaluation. Fifteen open-source and closed-source MLLMs are benchmarked, revealing that the best models struggle relative to human performance (about 87.4%), notably in navigation and tool-use scenarios. The work analyzes model limitations and proposes targeted prompting and finetuning strategies, establishing EgoBlind as a foundational benchmark to drive progress in egocentric visual assistance for the blind. It aims to catalyze the development of AI assistants that improve independence and safety for visually impaired individuals, while providing researchers with a realistic, first-person data resource.

Abstract

We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,392 first-person videos from the daily lives of blind and visually impaired individuals. It also features 5,311 questions directly posed or verified by the blind to reflect their in-situation needs for visual assistance. Each question has an average of 3 manually annotated reference answers to reduce subjectiveness. Using EgoBlind, we comprehensively evaluate 16 advanced MLLMs and find that all models struggle. The best performers achieve an accuracy near 60\%, which is far behind human performance of 87.4\%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and explore heuristic solutions for improvement. With these efforts, we hope that EgoBlind will serve as a foundation for developing effective AI assistants to enhance the independence of the blind and visually impaired. Data and code are available at https://github.com/doc-doc/EgoBlind.

EgoBlind: Towards Egocentric Visual Assistance for the Blind

TL;DR

EgoBlind introduces the first egocentric VideoQA dataset collected from blind individuals to benchmark multimodal LLMs on live assistive tasks. It comprises 1,329 videos and 5,311 questions across six categories, with online timestamped QA and multiple reference answers to enable robust evaluation. Fifteen open-source and closed-source MLLMs are benchmarked, revealing that the best models struggle relative to human performance (about 87.4%), notably in navigation and tool-use scenarios. The work analyzes model limitations and proposes targeted prompting and finetuning strategies, establishing EgoBlind as a foundational benchmark to drive progress in egocentric visual assistance for the blind. It aims to catalyze the development of AI assistants that improve independence and safety for visually impaired individuals, while providing researchers with a realistic, first-person data resource.

Abstract

We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,392 first-person videos from the daily lives of blind and visually impaired individuals. It also features 5,311 questions directly posed or verified by the blind to reflect their in-situation needs for visual assistance. Each question has an average of 3 manually annotated reference answers to reduce subjectiveness. Using EgoBlind, we comprehensively evaluate 16 advanced MLLMs and find that all models struggle. The best performers achieve an accuracy near 60\%, which is far behind human performance of 87.4\%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and explore heuristic solutions for improvement. With these efforts, we hope that EgoBlind will serve as a foundation for developing effective AI assistants to enhance the independence of the blind and visually impaired. Data and code are available at https://github.com/doc-doc/EgoBlind.

Paper Structure

This paper contains 23 sections, 21 figures, 12 tables.

Figures (21)

  • Figure 1: Example from EgoBlind about a blind user demonstrating egocentric visual assistance. As she places her hands on various microwave dials, she asks a series of questions about what the dial controls, its position and settings and how to adjust it.
  • Figure 2: Data annotation pipeline.
  • Figure 3: Statistic analysis of EgoBlind. *: We omit 1,123 (22.6%) "yes/no" and 541 (11.0%) "do not know" answers in (d) and (f) for better presentation. (Please zoom in for better view.)
  • Figure 4: Common failure cases of tested MLLMs. The models fail to (a) reason user intention, (b) understand real-time spatial orientation, provide (c) assistive and (d) reliable answers.
  • Figure 5: Single frame inputs.
  • ...and 16 more figures