GPT-4V as Traffic Assistant: An In-depth Look at Vision Language Model on Complex Traffic Events

Xingcheng Zhou; Alois C. Knoll

GPT-4V as Traffic Assistant: An In-depth Look at Vision Language Model on Complex Traffic Events

Xingcheng Zhou, Alois C. Knoll

TL;DR

The paper investigates GPT-4V's ability to understand complex traffic events using carefully selected keyframes from diverse incident videos, evaluating recognition, causal inference, and decision-making in a zero-shot setting. It finds that GPT-4V can achieve strong performance on classic events such as dooring, red-light running, motorcycle collisions, rollovers, and fires, but struggles with spatial reasoning, fine-grained vehicle attributes, and multi-vehicle interactions. The results highlight GPT-4V's potential for traffic incident understanding while also revealing substantial limitations that arise without additional modalities. The study suggests that incorporating acoustic signals, continuous video, and 3D spatial information, as well as improved cross-image tracking, will be needed for robust real-world applications.

Abstract

The recognition and understanding of traffic incidents, particularly traffic accidents, is a topic of paramount importance in the realm of intelligent transportation systems and intelligent vehicles. This area has continually captured the extensive focus of both the academic and industrial sectors. Identifying and comprehending complex traffic events is highly challenging, primarily due to the intricate nature of traffic environments, diverse observational perspectives, and the multifaceted causes of accidents. These factors have persistently impeded the development of effective solutions. The advent of large vision-language models (VLMs) such as GPT-4V, has introduced innovative approaches to addressing this issue. In this paper, we explore the ability of GPT-4V with a set of representative traffic incident videos and delve into the model's capacity of understanding these complex traffic situations. We observe that GPT-4V demonstrates remarkable cognitive, reasoning, and decision-making ability in certain classic traffic events. Concurrently, we also identify certain limitations of GPT-4V, which constrain its understanding in more intricate scenarios. These limitations merit further exploration and resolution.

GPT-4V as Traffic Assistant: An In-depth Look at Vision Language Model on Complex Traffic Events

TL;DR

Abstract

GPT-4V as Traffic Assistant: An In-depth Look at Vision Language Model on Complex Traffic Events

Authors

TL;DR

Abstract

Table of Contents

Figures (20)