Table of Contents
Fetching ...

GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance

Junhyeok Kim, Jaewoo Park, Junhee Park, Sangeyl Lee, Jiwan Chung, Jisung Kim, Ji Hoon Joung, Youngjae Yu

TL;DR

This work tackles the challenge of enabling reliable BLV mobility support by introducing GuideDog, a 22K-image multimodal dataset (with 2K gold annotations) plus GuideDogQA for fine-grained perception, all built around three BLV-specific standards. It implements a scalable human–AI annotation pipeline that converts generation into verification, enabling large-scale, realistic egocentric data from diverse geographies. Comprehensive experiments show that direct visual perception via MLLMs and spatial reasoning are key for effective BLV guidance, with GPT-4o excelling in zero-shot guidance and domain-specific fine-tuning delivering strong gains; depth perception remains a core challenge for open-source models. The dataset and benchmarks are poised to advance BLV assistive technologies and broader egocentric reasoning applications, with code and data to be publicly released for community use.

Abstract

Mobility remains a significant challenge for the 2.2 billion people worldwide affected by blindness and low vision (BLV), with 7% of visually impaired individuals experiencing falls at least once a month. While recent advances in Multimodal Large Language Models (MLLMs) offer promising opportunities for BLV assistance, their development has been hindered by limited datasets. This limitation stems from the fact that BLV-aware annotation requires specialized domain knowledge and intensive labor. To address this gap, we introduce GuideDog, a novel accessibility-aware guide dataset containing 22K image-description pairs (including 2K human-annotated pairs) that capture diverse real-world scenes from a pedestrian's viewpoint. Our approach shifts the annotation burden from generation to verification through a collaborative human-AI framework grounded in established accessibility standards, significantly improving efficiency while maintaining high-quality annotations. We also develop GuideDogQA, a subset of 818 samples featuring multiple-choice questions designed to evaluate fine-grained visual perception capabilities, specifically object recognition and relative depth perception. Our experimental results highlight the importance of accurate spatial understanding for effective BLV guidance. GuideDog and GuideDogQA will advance research in MLLM-based assistive technologies for BLV individuals while contributing to broader applications in understanding egocentric scenes for robotics and augmented reality. The code and dataset will be publicly available.

GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance

TL;DR

This work tackles the challenge of enabling reliable BLV mobility support by introducing GuideDog, a 22K-image multimodal dataset (with 2K gold annotations) plus GuideDogQA for fine-grained perception, all built around three BLV-specific standards. It implements a scalable human–AI annotation pipeline that converts generation into verification, enabling large-scale, realistic egocentric data from diverse geographies. Comprehensive experiments show that direct visual perception via MLLMs and spatial reasoning are key for effective BLV guidance, with GPT-4o excelling in zero-shot guidance and domain-specific fine-tuning delivering strong gains; depth perception remains a core challenge for open-source models. The dataset and benchmarks are poised to advance BLV assistive technologies and broader egocentric reasoning applications, with code and data to be publicly released for community use.

Abstract

Mobility remains a significant challenge for the 2.2 billion people worldwide affected by blindness and low vision (BLV), with 7% of visually impaired individuals experiencing falls at least once a month. While recent advances in Multimodal Large Language Models (MLLMs) offer promising opportunities for BLV assistance, their development has been hindered by limited datasets. This limitation stems from the fact that BLV-aware annotation requires specialized domain knowledge and intensive labor. To address this gap, we introduce GuideDog, a novel accessibility-aware guide dataset containing 22K image-description pairs (including 2K human-annotated pairs) that capture diverse real-world scenes from a pedestrian's viewpoint. Our approach shifts the annotation burden from generation to verification through a collaborative human-AI framework grounded in established accessibility standards, significantly improving efficiency while maintaining high-quality annotations. We also develop GuideDogQA, a subset of 818 samples featuring multiple-choice questions designed to evaluate fine-grained visual perception capabilities, specifically object recognition and relative depth perception. Our experimental results highlight the importance of accurate spatial understanding for effective BLV guidance. GuideDog and GuideDogQA will advance research in MLLM-based assistive technologies for BLV individuals while contributing to broader applications in understanding egocentric scenes for robotics and augmented reality. The code and dataset will be publicly available.

Paper Structure

This paper contains 44 sections, 20 figures, 6 tables.

Figures (20)

  • Figure 1: An overview of the accessibility-aware guidance generation task, wherein MLLMs describe the overall situation (S1), identify hazardous obstacles (S2), and summarize recommended actions for BLV. (S3)
  • Figure 2: Side-by-side comparison of examples from (a) GuideDog and (b) VIALM zhao2024vialm. GuideDog consists of real-world scenes from pedestrian viewpoint, while VIALM comprises of home and supermarket images.
  • Figure 3: An overview of the GuideDog generation pipeline, ensuring all stages adhere to GuideDog standards (S1, S2, S3). For the collected scene image displayed at the center top, (1) in accordance with S1, the MLLM first extracts a comprehensive scene description; (2) following S2, both off-the-shelf models and the MLLM identify obstacles; (3) next, the extracted information is incorporated into a BLV-specific instruction designed to adhere to S3, generating a silver label that satisfies S1, S2, and S3; and (4) finally, human annotators assess and refine the silver labels to produce gold labels.
  • Figure 4: Visualization of the worldwide city distribution in the GuideDog dataset.
  • Figure 5: Country distribution of samples in the GuideDog dataset.
  • ...and 15 more figures