GPT-4V Takes the Wheel: Promises and Challenges for Pedestrian Behavior Prediction

Jia Huang; Peng Jiang; Alvika Gautam; Srikanth Saripalli

GPT-4V Takes the Wheel: Promises and Challenges for Pedestrian Behavior Prediction

Jia Huang, Peng Jiang, Alvika Gautam, Srikanth Saripalli

TL;DR

This work evaluates GPT-4V for pedestrian behavior prediction in autonomous driving, addressing limitations of deep learning in capturing dynamic interactions and causal reasoning. By performing quantitative and qualitative assessments on JAAD and WiDEVIEW, the authors show GPT-4V achieves $57\%$ zero-shot accuracy (below the state-of-the-art $70\%$) but exhibits strong scene understanding and group-dynamics analysis, while struggling with small pedestrians and relative motion. The study highlights both the potential of Vision-Language Models to reason about complex traffic scenes and the gaps that remain before real-time deployment, including consistency across prompts and improved long-horizon prediction. The findings suggest that targeted prompt design, data augmentation, and possibly fine-tuning are needed to realize practical, safe pedestrian-prediction capabilities in urban AV systems.

Abstract

Predicting pedestrian behavior is the key to ensure safety and reliability of autonomous vehicles. While deep learning methods have been promising by learning from annotated video frame sequences, they often fail to fully grasp the dynamic interactions between pedestrians and traffic, crucial for accurate predictions. These models also lack nuanced common sense reasoning. Moreover, the manual annotation of datasets for these models is expensive and challenging to adapt to new situations. The advent of Vision Language Models (VLMs) introduces promising alternatives to these issues, thanks to their advanced visual and causal reasoning skills. To our knowledge, this research is the first to conduct both quantitative and qualitative evaluations of VLMs in the context of pedestrian behavior prediction for autonomous driving. We evaluate GPT-4V(ision) on publicly available pedestrian datasets: JAAD and WiDEVIEW. Our quantitative analysis focuses on GPT-4V's ability to predict pedestrian behavior in current and future frames. The model achieves a 57% accuracy in a zero-shot manner, which, while impressive, is still behind the state-of-the-art domain-specific models (70%) in predicting pedestrian crossing actions. Qualitatively, GPT-4V shows an impressive ability to process and interpret complex traffic scenarios, differentiate between various pedestrian behaviors, and detect and analyze groups. However, it faces challenges, such as difficulty in detecting smaller pedestrians and assessing the relative motion between pedestrians and the ego vehicle.

GPT-4V Takes the Wheel: Promises and Challenges for Pedestrian Behavior Prediction

TL;DR

zero-shot accuracy (below the state-of-the-art

) but exhibits strong scene understanding and group-dynamics analysis, while struggling with small pedestrians and relative motion. The study highlights both the potential of Vision-Language Models to reason about complex traffic scenes and the gaps that remain before real-time deployment, including consistency across prompts and improved long-horizon prediction. The findings suggest that targeted prompt design, data augmentation, and possibly fine-tuning are needed to realize practical, safe pedestrian-prediction capabilities in urban AV systems.

Abstract

Paper Structure (22 sections, 2 figures, 3 tables)

This paper contains 22 sections, 2 figures, 3 tables.

Introduction
Evaluation Method
Datasets
Quantitative Experiment Design
Qualitative Experiment Design
Results
Quantitative Experiments
Evaluation Metrics
Overall Evaluation
Comparison with State-of-the-Art Models
Current Pedestrian Behavior Recognition Frame by Frame
Future Pedestrian Behavior Prediction Frame by Frame
Future Pedestrian Behavior Prediction by Summary
Qualitative Analysis
Scene understanding
...and 7 more sections

Figures (2)

Figure 1: Accuracy of current pedestrian behavior recognition across different bounding boxes area ratios.
Figure 2: Average entropy of current pedestrian behavior recognition across various pedestrian size ranges.

GPT-4V Takes the Wheel: Promises and Challenges for Pedestrian Behavior Prediction

TL;DR

Abstract

GPT-4V Takes the Wheel: Promises and Challenges for Pedestrian Behavior Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (2)