Table of Contents
Fetching ...

Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera

Inpyo Song, Minjun Joo, Joonhyung Kwon, Jangwon Lee

TL;DR

The paper tackles enabling AI-assisted support for visually impaired individuals by introducing VIEW-QA, a VideoQA dataset based on wearable 360° egocentric video. It aggregates 1,030 videos (~10 hours, 1,062,960 frames) with 4,120 QA pairs across five VIP-relevant categories, and evaluates modern CNN-based and Vision-Language Pretrained models, supplemented by LLM-based semantic scoring. Results show that while recent models outperform simple baselines, overall performance remains below practical levels for reliable assistance, highlighting the remaining gap in dynamic, real-world VIP understanding. This work provides a new multi-task benchmark and dataset to advance autonomous, around-the-judge interpretation of complex surroundings for VIPs, guiding future data collection and model development to improve daily independence and safety.

Abstract

This paper addresses the daily challenges encountered by visually impaired individuals, such as limited access to information, navigation difficulties, and barriers to social interaction. To alleviate these challenges, we introduce a novel visual question answering dataset. Our dataset offers two significant advancements over previous datasets: Firstly, it features videos captured using a 360-degree egocentric wearable camera, enabling observation of the entire surroundings, departing from the static image-centric nature of prior datasets. Secondly, unlike datasets centered on singular challenges, ours addresses multiple real-life obstacles simultaneously through an innovative visual-question answering framework. We validate our dataset using various state-of-the-art VideoQA methods and diverse metrics. Results indicate that while progress has been made, satisfactory performance levels for AI-powered assistive services remain elusive for visually impaired individuals. Additionally, our evaluation highlights the distinctive features of the proposed dataset, featuring ego-motion in videos captured via 360-degree cameras across varied scenarios.

Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera

TL;DR

The paper tackles enabling AI-assisted support for visually impaired individuals by introducing VIEW-QA, a VideoQA dataset based on wearable 360° egocentric video. It aggregates 1,030 videos (~10 hours, 1,062,960 frames) with 4,120 QA pairs across five VIP-relevant categories, and evaluates modern CNN-based and Vision-Language Pretrained models, supplemented by LLM-based semantic scoring. Results show that while recent models outperform simple baselines, overall performance remains below practical levels for reliable assistance, highlighting the remaining gap in dynamic, real-world VIP understanding. This work provides a new multi-task benchmark and dataset to advance autonomous, around-the-judge interpretation of complex surroundings for VIPs, guiding future data collection and model development to improve daily independence and safety.

Abstract

This paper addresses the daily challenges encountered by visually impaired individuals, such as limited access to information, navigation difficulties, and barriers to social interaction. To alleviate these challenges, we introduce a novel visual question answering dataset. Our dataset offers two significant advancements over previous datasets: Firstly, it features videos captured using a 360-degree egocentric wearable camera, enabling observation of the entire surroundings, departing from the static image-centric nature of prior datasets. Secondly, unlike datasets centered on singular challenges, ours addresses multiple real-life obstacles simultaneously through an innovative visual-question answering framework. We validate our dataset using various state-of-the-art VideoQA methods and diverse metrics. Results indicate that while progress has been made, satisfactory performance levels for AI-powered assistive services remain elusive for visually impaired individuals. Additionally, our evaluation highlights the distinctive features of the proposed dataset, featuring ego-motion in videos captured via 360-degree cameras across varied scenarios.
Paper Structure (12 sections, 2 figures, 2 tables)

This paper contains 12 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: We introduce a novel visual question answering dataset comprising videos captured with a wearable 360-degree camera, aiming to address common challenges visually impaired individuals may encounter by recording entire surroundings and providing VQA-style annotations for various situations.
  • Figure 2: Overview of VIEW-QA dataset characteristics. (a) Distribution of the question types. (b) & (c) Average question and answer lengths are 5.6 and 3.2 words, designed for VIPs with concise questions and detailed answers from visual perception challenges.