Table of Contents
Fetching ...

Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired

ByungOk Han, Woo-han Yun, Beom-Su Seo, Jaehong Kim

TL;DR

This work tackles the critical challenge of spatial understanding in vision-language models for guide dog robots assisting visually impaired individuals. It introduces the SAIT dataset and SA-Bench, along with an automated data-generation pipeline that grounds descriptions in a 3D space to a designated goal, including depth-aware path planning and region masking to improve description accuracy. Experimental results with a space-aware SA-VLM show superior performance over several state-of-the-art models, highlighting improved concise walking guidance and faster inferences. The authors open-source the datasets and code, enabling broader evaluation and deployment, while acknowledging the need for human-in-the-loop data and real-world validations such as user studies and latency assessments.

Abstract

Guide dog robots offer promising solutions to enhance mobility and safety for visually impaired individuals, addressing the limitations of traditional guide dogs, particularly in perceptual intelligence and communication. With the emergence of Vision-Language Models (VLMs), robots are now capable of generating natural language descriptions of their surroundings, aiding in safer decision-making. However, existing VLMs often struggle to accurately interpret and convey spatial relationships, which is crucial for navigation in complex environments such as street crossings. We introduce the Space-Aware Instruction Tuning (SAIT) dataset and the Space-Aware Benchmark (SA-Bench) to address the limitations of current VLMs in understanding physical environments. Our automated data generation pipeline focuses on the virtual path to the destination in 3D space and the surroundings, enhancing environmental comprehension and enabling VLMs to provide more accurate guidance to visually impaired individuals. We also propose an evaluation protocol to assess VLM effectiveness in delivering walking guidance. Comparative experiments demonstrate that our space-aware instruction-tuned model outperforms state-of-the-art algorithms. We have fully open-sourced the SAIT dataset and SA-Bench, along with the related code, at https://github.com/byungokhan/Space-awareVLM

Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired

TL;DR

This work tackles the critical challenge of spatial understanding in vision-language models for guide dog robots assisting visually impaired individuals. It introduces the SAIT dataset and SA-Bench, along with an automated data-generation pipeline that grounds descriptions in a 3D space to a designated goal, including depth-aware path planning and region masking to improve description accuracy. Experimental results with a space-aware SA-VLM show superior performance over several state-of-the-art models, highlighting improved concise walking guidance and faster inferences. The authors open-source the datasets and code, enabling broader evaluation and deployment, while acknowledging the need for human-in-the-loop data and real-world validations such as user studies and latency assessments.

Abstract

Guide dog robots offer promising solutions to enhance mobility and safety for visually impaired individuals, addressing the limitations of traditional guide dogs, particularly in perceptual intelligence and communication. With the emergence of Vision-Language Models (VLMs), robots are now capable of generating natural language descriptions of their surroundings, aiding in safer decision-making. However, existing VLMs often struggle to accurately interpret and convey spatial relationships, which is crucial for navigation in complex environments such as street crossings. We introduce the Space-Aware Instruction Tuning (SAIT) dataset and the Space-Aware Benchmark (SA-Bench) to address the limitations of current VLMs in understanding physical environments. Our automated data generation pipeline focuses on the virtual path to the destination in 3D space and the surroundings, enhancing environmental comprehension and enabling VLMs to provide more accurate guidance to visually impaired individuals. We also propose an evaluation protocol to assess VLM effectiveness in delivering walking guidance. Comparative experiments demonstrate that our space-aware instruction-tuned model outperforms state-of-the-art algorithms. We have fully open-sourced the SAIT dataset and SA-Bench, along with the related code, at https://github.com/byungokhan/Space-awareVLM

Paper Structure

This paper contains 25 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Our space-aware instruction tuning method provides helpful and concise walking guidance to individuals with visual impairments. Given an image and a specified goal position as input, our method identifies a virtual path in 3D space and provides compact responses to five types of queries in a single inference: descriptions of the destination, the left and right sides of the path, the path itself, and a decision on whether the path is traversable, with a brief reason.
  • Figure 2: Our automatic dataset generation pipeline that focuses on the virtual path to the destination in 3D space and the surroundings. Given an input image and a goal position, we extract object locations and generate a depth map to identify walkable areas and surrounding environments, which are then utilized as inputs for a VLM. Subsequently, we generate five distinct sentences through separate queries and integrate them to produce a recommendation statement on whether visually impaired individuals can traverse the specified path with a brief reason.
  • Figure 3: An image and annotations of the SAIT dataset: queries were generated based on the goal positions, and answers were structured in XML format. Note that words like <dest_desc> are plain text in the answer, with only the image token in the query treated as a special token.
  • Figure 4: Comparisons of algorithms across different metrics for the generated 'Reco.' sentences. We chose LLM Judge as our main metric because it provides scoring based on the contextual meaning of the generated sentences.