Is it safe to cross? Interpretable Risk Assessment with GPT-4V for Safety-Aware Street Crossing

Hochul Hwang; Sunjae Kwon; Yekyung Kim; Donghyun Kim

Is it safe to cross? Interpretable Risk Assessment with GPT-4V for Safety-Aware Street Crossing

Hochul Hwang, Sunjae Kwon, Yekyung Kim, Donghyun Kim

TL;DR

This work addresses safety-aware street crossing for blind and low-vision individuals by moving beyond traffic-signal recognition to comprehensive scene understanding using a vision-language model. It introduces a crosswalk dataset collected with a quadruped robot and leverages visual knowledge prompts (bounding boxes, segmentation masks, and optical flow) to enable GPT-4V to output a safety score and scene description in natural language. Experimental results show that temporal information via optical flow most improves risk assessment accuracy, while Chain-of-Thought prompting did not consistently help, revealing important considerations for prompt engineering in multimodal contexts. The study advances practical mobility aids for BLV individuals by demonstrating a path toward trustworthy safety reasoning, with future work focused on temporal understanding, viewpoint influence, and personalized adaptation.

Abstract

Safely navigating street intersections is a complex challenge for blind and low-vision individuals, as it requires a nuanced understanding of the surrounding context - a task heavily reliant on visual cues. Traditional methods for assisting in this decision-making process often fall short, lacking the ability to provide a comprehensive scene analysis and safety level. This paper introduces an innovative approach that leverages large multimodal models (LMMs) to interpret complex street crossing scenes, offering a potential advancement over conventional traffic signal recognition techniques. By generating a safety score and scene description in natural language, our method supports safe decision-making for the blind and low-vision individuals. We collected crosswalk intersection data that contains multiview egocentric images captured by a quadruped robot and annotated the images with corresponding safety scores based on our predefined safety score categorization. Grounded on the visual knowledge, extracted from images, and text prompt, we evaluate a large multimodal model for safety score prediction and scene description. Our findings highlight the reasoning and safety score prediction capabilities of a LMM, activated by various prompts, as a pathway to developing a trustworthy system, crucial for applications requiring reliable decision-making support.

Is it safe to cross? Interpretable Risk Assessment with GPT-4V for Safety-Aware Street Crossing

TL;DR

Abstract

Paper Structure (29 sections, 8 figures, 4 tables)

This paper contains 29 sections, 8 figures, 4 tables.

INTRODUCTION
RELATED WORK
Assistive Technology for Blind Street Crossing
Vision Language Models
METHOD
Visual Knowledge as Prompts
Crosswalk data collection
Safety score categorization
Visual knowledge extraction
Optical flow
Segmentation mask
Bounding box
Vision-Language Model for Risk Assessment
Instruction prompt for safety evaluation
Auto Chain-of-Thought for safety measurement
...and 14 more sections

Figures (8)

Figure 1: Illustrative concept of VLM evaluation pipeline for safety-aware street crossing. Visual knowledge of object detection bounding boxes, segmentation masks, and optical flow is extracted from the robot's multiview egocentric images. This is then provided to the VLM along with the text prompts. The VLM outputs the safety score and scene description.
Figure 2: Data collection using a quadruped robot. (a) The remote controlled robot collected RGB and depth data in crosswalk settings. One of the researchers is driving the car to simulate an unsafe scenario for street crossing. (b) Various onboard sensors enable egocentric multiview data.
Figure 3: Visualization of visual knowledge. (a) Raw multiview images contain the egocentric front (top patch), left (bottom left patch), bottom (bottom center patch), and right (bottom right patch) viewpoints of the robot. (b) Bounding boxes are added using an object detection algorithm. (c) Segmentation masks are added using an instance segmentation algorithm. (d) The average optical flow for each viewpoint (except for the bottom) is represented by a red arrow.
Figure 4: Prompt for instructing of safety measurement
Figure 5: Prompt for Auto Chain-of-Thought
...and 3 more figures

Is it safe to cross? Interpretable Risk Assessment with GPT-4V for Safety-Aware Street Crossing

TL;DR

Abstract

Is it safe to cross? Interpretable Risk Assessment with GPT-4V for Safety-Aware Street Crossing

Authors

TL;DR

Abstract

Table of Contents

Figures (8)