GPT-4V Takes the Wheel: Promises and Challenges for Pedestrian Behavior Prediction
Jia Huang, Peng Jiang, Alvika Gautam, Srikanth Saripalli
TL;DR
This work evaluates GPT-4V for pedestrian behavior prediction in autonomous driving, addressing limitations of deep learning in capturing dynamic interactions and causal reasoning. By performing quantitative and qualitative assessments on JAAD and WiDEVIEW, the authors show GPT-4V achieves $57\%$ zero-shot accuracy (below the state-of-the-art $70\%$) but exhibits strong scene understanding and group-dynamics analysis, while struggling with small pedestrians and relative motion. The study highlights both the potential of Vision-Language Models to reason about complex traffic scenes and the gaps that remain before real-time deployment, including consistency across prompts and improved long-horizon prediction. The findings suggest that targeted prompt design, data augmentation, and possibly fine-tuning are needed to realize practical, safe pedestrian-prediction capabilities in urban AV systems.
Abstract
Predicting pedestrian behavior is the key to ensure safety and reliability of autonomous vehicles. While deep learning methods have been promising by learning from annotated video frame sequences, they often fail to fully grasp the dynamic interactions between pedestrians and traffic, crucial for accurate predictions. These models also lack nuanced common sense reasoning. Moreover, the manual annotation of datasets for these models is expensive and challenging to adapt to new situations. The advent of Vision Language Models (VLMs) introduces promising alternatives to these issues, thanks to their advanced visual and causal reasoning skills. To our knowledge, this research is the first to conduct both quantitative and qualitative evaluations of VLMs in the context of pedestrian behavior prediction for autonomous driving. We evaluate GPT-4V(ision) on publicly available pedestrian datasets: JAAD and WiDEVIEW. Our quantitative analysis focuses on GPT-4V's ability to predict pedestrian behavior in current and future frames. The model achieves a 57% accuracy in a zero-shot manner, which, while impressive, is still behind the state-of-the-art domain-specific models (70%) in predicting pedestrian crossing actions. Qualitatively, GPT-4V shows an impressive ability to process and interpret complex traffic scenarios, differentiate between various pedestrian behaviors, and detect and analyze groups. However, it faces challenges, such as difficulty in detecting smaller pedestrians and assessing the relative motion between pedestrians and the ego vehicle.
