Table of Contents
Fetching ...

Is it safe to cross? Interpretable Risk Assessment with GPT-4V for Safety-Aware Street Crossing

Hochul Hwang, Sunjae Kwon, Yekyung Kim, Donghyun Kim

TL;DR

This work addresses safety-aware street crossing for blind and low-vision individuals by moving beyond traffic-signal recognition to comprehensive scene understanding using a vision-language model. It introduces a crosswalk dataset collected with a quadruped robot and leverages visual knowledge prompts (bounding boxes, segmentation masks, and optical flow) to enable GPT-4V to output a safety score and scene description in natural language. Experimental results show that temporal information via optical flow most improves risk assessment accuracy, while Chain-of-Thought prompting did not consistently help, revealing important considerations for prompt engineering in multimodal contexts. The study advances practical mobility aids for BLV individuals by demonstrating a path toward trustworthy safety reasoning, with future work focused on temporal understanding, viewpoint influence, and personalized adaptation.

Abstract

Safely navigating street intersections is a complex challenge for blind and low-vision individuals, as it requires a nuanced understanding of the surrounding context - a task heavily reliant on visual cues. Traditional methods for assisting in this decision-making process often fall short, lacking the ability to provide a comprehensive scene analysis and safety level. This paper introduces an innovative approach that leverages large multimodal models (LMMs) to interpret complex street crossing scenes, offering a potential advancement over conventional traffic signal recognition techniques. By generating a safety score and scene description in natural language, our method supports safe decision-making for the blind and low-vision individuals. We collected crosswalk intersection data that contains multiview egocentric images captured by a quadruped robot and annotated the images with corresponding safety scores based on our predefined safety score categorization. Grounded on the visual knowledge, extracted from images, and text prompt, we evaluate a large multimodal model for safety score prediction and scene description. Our findings highlight the reasoning and safety score prediction capabilities of a LMM, activated by various prompts, as a pathway to developing a trustworthy system, crucial for applications requiring reliable decision-making support.

Is it safe to cross? Interpretable Risk Assessment with GPT-4V for Safety-Aware Street Crossing

TL;DR

This work addresses safety-aware street crossing for blind and low-vision individuals by moving beyond traffic-signal recognition to comprehensive scene understanding using a vision-language model. It introduces a crosswalk dataset collected with a quadruped robot and leverages visual knowledge prompts (bounding boxes, segmentation masks, and optical flow) to enable GPT-4V to output a safety score and scene description in natural language. Experimental results show that temporal information via optical flow most improves risk assessment accuracy, while Chain-of-Thought prompting did not consistently help, revealing important considerations for prompt engineering in multimodal contexts. The study advances practical mobility aids for BLV individuals by demonstrating a path toward trustworthy safety reasoning, with future work focused on temporal understanding, viewpoint influence, and personalized adaptation.

Abstract

Safely navigating street intersections is a complex challenge for blind and low-vision individuals, as it requires a nuanced understanding of the surrounding context - a task heavily reliant on visual cues. Traditional methods for assisting in this decision-making process often fall short, lacking the ability to provide a comprehensive scene analysis and safety level. This paper introduces an innovative approach that leverages large multimodal models (LMMs) to interpret complex street crossing scenes, offering a potential advancement over conventional traffic signal recognition techniques. By generating a safety score and scene description in natural language, our method supports safe decision-making for the blind and low-vision individuals. We collected crosswalk intersection data that contains multiview egocentric images captured by a quadruped robot and annotated the images with corresponding safety scores based on our predefined safety score categorization. Grounded on the visual knowledge, extracted from images, and text prompt, we evaluate a large multimodal model for safety score prediction and scene description. Our findings highlight the reasoning and safety score prediction capabilities of a LMM, activated by various prompts, as a pathway to developing a trustworthy system, crucial for applications requiring reliable decision-making support.
Paper Structure (29 sections, 8 figures, 4 tables)

This paper contains 29 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Illustrative concept of VLM evaluation pipeline for safety-aware street crossing. Visual knowledge of object detection bounding boxes, segmentation masks, and optical flow is extracted from the robot's multiview egocentric images. This is then provided to the VLM along with the text prompts. The VLM outputs the safety score and scene description.
  • Figure 2: Data collection using a quadruped robot. (a) The remote controlled robot collected RGB and depth data in crosswalk settings. One of the researchers is driving the car to simulate an unsafe scenario for street crossing. (b) Various onboard sensors enable egocentric multiview data.
  • Figure 3: Visualization of visual knowledge. (a) Raw multiview images contain the egocentric front (top patch), left (bottom left patch), bottom (bottom center patch), and right (bottom right patch) viewpoints of the robot. (b) Bounding boxes are added using an object detection algorithm. (c) Segmentation masks are added using an instance segmentation algorithm. (d) The average optical flow for each viewpoint (except for the bottom) is represented by a red arrow.
  • Figure 4: Prompt for instructing of safety measurement
  • Figure 5: Prompt for Auto Chain-of-Thought
  • ...and 3 more figures