Probing Multimodal LLMs as World Models for Driving

Shiva Sreeram; Tsun-Hsuan Wang; Alaa Maalouf; Guy Rosman; Sertac Karaman; Daniela Rus

Probing Multimodal LLMs as World Models for Driving

Shiva Sreeram, Tsun-Hsuan Wang, Alaa Maalouf, Guy Rosman, Sertac Karaman, Daniela Rus

TL;DR

This experimental study assesses various MLLMs as world models using in-car camera perspectives and reveals that while these models excel at interpreting individual images, they struggle to synthesize coherent narratives across frames, leading to considerable inaccuracies in understanding ego vehicle dynamics.

Abstract

We provide a sober look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving, challenging common assumptions about their ability to interpret dynamic driving scenarios. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored. Our experimental study assesses various MLLMs as world models using in-car camera perspectives and reveals that while these models excel at interpreting individual images, they struggle to synthesize coherent narratives across frames, leading to considerable inaccuracies in understanding (i) ego vehicle dynamics, (ii) interactions with other road actors, (iii) trajectory planning, and (iv) open-set scene reasoning. We introduce the Eval-LLM-Drive dataset and DriveSim simulator to enhance our evaluation, highlighting gaps in current MLLM capabilities and the need for improved models in dynamic real-world environments.

Probing Multimodal LLMs as World Models for Driving

TL;DR

Abstract

Paper Structure (13 sections, 13 figures, 2 tables)

This paper contains 13 sections, 13 figures, 2 tables.

INTRODUCTION
RELATED WORK
PROBING FROM A DATA PERSPECTIVE
Providing the Means to Evaluate a Driving World Model
Scenarios directly from the real world
Scenarios by re-simulation of real-world data
EXPERIMENTAL STUDY
Ego Motion Reasoning
Other Actor Behavior Reasoning
Open-Set Reasoning
Planning Reasoning
Real versus Simulated Data
CONCLUSION

Figures (13)

Figure 1: Are MLLMs world models for driving? We investigate their effectiveness in understanding and reasoning about dynamic driving scenarios from sequential images with an introduced real-world driving and re-simulated dataset. Our experiments show that MLLMs struggle to form coherent narratives, failing to reason about car motion, traffic, etc.
Figure 2: Components of what a model must understand to be a world model for driving.
Figure 3: Heavy traffic scene provided by DriveSim converted into a grid alongside the text prompt.
Figure 4: Accelerate vs decelerate: Confusion matrices.
Figure 5: Frames of ego-motion videos of scenes provided by DriveSim.
...and 8 more figures

Probing Multimodal LLMs as World Models for Driving

TL;DR

Abstract

Probing Multimodal LLMs as World Models for Driving

Authors

TL;DR

Abstract

Table of Contents

Figures (13)