Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera
Inpyo Song, Minjun Joo, Joonhyung Kwon, Jangwon Lee
TL;DR
The paper tackles enabling AI-assisted support for visually impaired individuals by introducing VIEW-QA, a VideoQA dataset based on wearable 360° egocentric video. It aggregates 1,030 videos (~10 hours, 1,062,960 frames) with 4,120 QA pairs across five VIP-relevant categories, and evaluates modern CNN-based and Vision-Language Pretrained models, supplemented by LLM-based semantic scoring. Results show that while recent models outperform simple baselines, overall performance remains below practical levels for reliable assistance, highlighting the remaining gap in dynamic, real-world VIP understanding. This work provides a new multi-task benchmark and dataset to advance autonomous, around-the-judge interpretation of complex surroundings for VIPs, guiding future data collection and model development to improve daily independence and safety.
Abstract
This paper addresses the daily challenges encountered by visually impaired individuals, such as limited access to information, navigation difficulties, and barriers to social interaction. To alleviate these challenges, we introduce a novel visual question answering dataset. Our dataset offers two significant advancements over previous datasets: Firstly, it features videos captured using a 360-degree egocentric wearable camera, enabling observation of the entire surroundings, departing from the static image-centric nature of prior datasets. Secondly, unlike datasets centered on singular challenges, ours addresses multiple real-life obstacles simultaneously through an innovative visual-question answering framework. We validate our dataset using various state-of-the-art VideoQA methods and diverse metrics. Results indicate that while progress has been made, satisfactory performance levels for AI-powered assistive services remain elusive for visually impaired individuals. Additionally, our evaluation highlights the distinctive features of the proposed dataset, featuring ego-motion in videos captured via 360-degree cameras across varied scenarios.
