TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos
Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Kentaro Arai, Seiji Totsuka, Hiroshi Ino, Takayuki Okatani
TL;DR
This work tackles the gap in traffic-specific spatiotemporal evaluation for multi-modal LLMs by introducing TB-Bench, a benchmark with eight ego-centric perception tasks, and two VLIT datasets TB-100k and TB-250k. It demonstrates that leading MLLMs perform poorly in zero-shot settings, while a simple baseline fine-tuned on TB-100k/ TB-250k achieves 77–85% average accuracy, underscoring the dataset's efficacy. The study also shows that co-training TB-100k with another driving dataset can improve generalization to downstream benchmarks like BDD-X, suggesting practical benefits for cross-domain transfer. Overall, TB-Bench provides a targeted, scalable framework to advance MLLMs in autonomous driving perception, prediction, and planning, along with datasets that facilitate robust, domain-specific fine-tuning.
Abstract
The application of Multi-modal Large Language Models (MLLMs) in Autonomous Driving (AD) faces significant challenges due to their limited training on traffic-specific data and the absence of dedicated benchmarks for spatiotemporal understanding. This study addresses these issues by proposing TB-Bench, a comprehensive benchmark designed to evaluate MLLMs on understanding traffic behaviors across eight perception tasks from ego-centric views. We also introduce vision-language instruction tuning datasets, TB-100k and TB-250k, along with simple yet effective baselines for the tasks. Through extensive experiments, we show that existing MLLMs underperform in these tasks, with even a powerful model like GPT-4o achieving less than 35% accuracy on average. In contrast, when fine-tuned with TB-100k or TB-250k, our baseline models achieve average accuracy up to 85%, significantly enhancing performance on the tasks. Additionally, we demonstrate performance transfer by co-training TB-100k with another traffic dataset, leading to improved performance on the latter. Overall, this study represents a step forward by introducing a comprehensive benchmark, high-quality datasets, and baselines, thus supporting the gradual integration of MLLMs into the perception, prediction, and planning stages of AD.
