Is it safe to cross? Interpretable Risk Assessment with GPT-4V for Safety-Aware Street Crossing
Hochul Hwang, Sunjae Kwon, Yekyung Kim, Donghyun Kim
TL;DR
This work addresses safety-aware street crossing for blind and low-vision individuals by moving beyond traffic-signal recognition to comprehensive scene understanding using a vision-language model. It introduces a crosswalk dataset collected with a quadruped robot and leverages visual knowledge prompts (bounding boxes, segmentation masks, and optical flow) to enable GPT-4V to output a safety score and scene description in natural language. Experimental results show that temporal information via optical flow most improves risk assessment accuracy, while Chain-of-Thought prompting did not consistently help, revealing important considerations for prompt engineering in multimodal contexts. The study advances practical mobility aids for BLV individuals by demonstrating a path toward trustworthy safety reasoning, with future work focused on temporal understanding, viewpoint influence, and personalized adaptation.
Abstract
Safely navigating street intersections is a complex challenge for blind and low-vision individuals, as it requires a nuanced understanding of the surrounding context - a task heavily reliant on visual cues. Traditional methods for assisting in this decision-making process often fall short, lacking the ability to provide a comprehensive scene analysis and safety level. This paper introduces an innovative approach that leverages large multimodal models (LMMs) to interpret complex street crossing scenes, offering a potential advancement over conventional traffic signal recognition techniques. By generating a safety score and scene description in natural language, our method supports safe decision-making for the blind and low-vision individuals. We collected crosswalk intersection data that contains multiview egocentric images captured by a quadruped robot and annotated the images with corresponding safety scores based on our predefined safety score categorization. Grounded on the visual knowledge, extracted from images, and text prompt, we evaluate a large multimodal model for safety score prediction and scene description. Our findings highlight the reasoning and safety score prediction capabilities of a LMM, activated by various prompts, as a pathway to developing a trustworthy system, crucial for applications requiring reliable decision-making support.
