TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos

Korawat Charoenpitaks; Van-Quang Nguyen; Masanori Suganuma; Kentaro Arai; Seiji Totsuka; Hiroshi Ino; Takayuki Okatani

TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos

Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Kentaro Arai, Seiji Totsuka, Hiroshi Ino, Takayuki Okatani

TL;DR

This work tackles the gap in traffic-specific spatiotemporal evaluation for multi-modal LLMs by introducing TB-Bench, a benchmark with eight ego-centric perception tasks, and two VLIT datasets TB-100k and TB-250k. It demonstrates that leading MLLMs perform poorly in zero-shot settings, while a simple baseline fine-tuned on TB-100k/ TB-250k achieves 77–85% average accuracy, underscoring the dataset's efficacy. The study also shows that co-training TB-100k with another driving dataset can improve generalization to downstream benchmarks like BDD-X, suggesting practical benefits for cross-domain transfer. Overall, TB-Bench provides a targeted, scalable framework to advance MLLMs in autonomous driving perception, prediction, and planning, along with datasets that facilitate robust, domain-specific fine-tuning.

Abstract

The application of Multi-modal Large Language Models (MLLMs) in Autonomous Driving (AD) faces significant challenges due to their limited training on traffic-specific data and the absence of dedicated benchmarks for spatiotemporal understanding. This study addresses these issues by proposing TB-Bench, a comprehensive benchmark designed to evaluate MLLMs on understanding traffic behaviors across eight perception tasks from ego-centric views. We also introduce vision-language instruction tuning datasets, TB-100k and TB-250k, along with simple yet effective baselines for the tasks. Through extensive experiments, we show that existing MLLMs underperform in these tasks, with even a powerful model like GPT-4o achieving less than 35% accuracy on average. In contrast, when fine-tuned with TB-100k or TB-250k, our baseline models achieve average accuracy up to 85%, significantly enhancing performance on the tasks. Additionally, we demonstrate performance transfer by co-training TB-100k with another traffic dataset, leading to improved performance on the latter. Overall, this study represents a step forward by introducing a comprehensive benchmark, high-quality datasets, and baselines, thus supporting the gradual integration of MLLMs into the perception, prediction, and planning stages of AD.

TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos

TL;DR

Abstract

Paper Structure (19 sections, 3 figures, 7 tables)

This paper contains 19 sections, 3 figures, 7 tables.

Introduction
Related Work
Autonomous Driving Tasks
MLLMs and Benchmarks
Benchmark Design
Task Design
Referencing Entities
Evaluation
Generation of VQA Data
Outline
Details of the Pipeline
Baseline Framework
Experiments
Experimental Settings
Zero-shot Evaluation for MLLMs
...and 4 more sections

Figures (3)

Figure 1: Examples of four tasks from TB-Bench; additional task examples are provided in the supplementary material.
Figure 2: Overview of Data Generation Pipeline. Left: Sensory data is processed into higher-level attributes. Middle-Top: Spatial positioning and lane orientation relative to the ego-vehicle are determined. Middle-Bottom: Q&A samples are generated using rules and LLM augmentation. Right: Data is filtered and refined for the final dataset.
Figure 3: The overall architecture of our baseline framework.

TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos

TL;DR

Abstract

TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (3)