Table of Contents
Fetching ...

A Framework for a Capability-driven Evaluation of Scenario Understanding for Multimodal Large Language Models in Autonomous Driving

Tin Stribor Sohn, Philipp Reis, Maximilian Dillitzer, Johannes Bach, Jason J. Corso, Eric Sax

TL;DR

The paper introduces a capability-driven framework to holistically evaluate multimodal large language models in autonomous driving, organizing scenario understanding into semantic, spatial, temporal, and physical dimensions with anticipation linking them. It formalizes context, modalities, and downstream tasks, and demonstrates applicability through two realistic scenarios at an urban intersection. The work surveys related literature on human driving capabilities, driving-oriented LLMs, and benchmarks, then synthesizes a unified framework and evaluation path. This framework aims to guide structured benchmarking, dataset design, and model development toward interpretable, generalizable, and safe language-guided autonomous driving systems.

Abstract

Multimodal large language models (MLLMs) hold the potential to enhance autonomous driving by combining domain-independent world knowledge with context-specific language guidance. Their integration into autonomous driving systems shows promising results in isolated proof-of-concept applications, while their performance is evaluated on selective singular aspects of perception, reasoning, or planning. To leverage their full potential a systematic framework for evaluating MLLMs in the context of autonomous driving is required. This paper proposes a holistic framework for a capability-driven evaluation of MLLMs in autonomous driving. The framework structures scenario understanding along the four core capability dimensions semantic, spatial, temporal, and physical. They are derived from the general requirements of autonomous driving systems, human driver cognition, and language-based reasoning. It further organises the domain into context layers, processing modalities, and downstream tasks such as language-based interaction and decision-making. To illustrate the framework's applicability, two exemplary traffic scenarios are analysed, grounding the proposed dimensions in realistic driving situations. The framework provides a foundation for the structured evaluation of MLLMs' potential for scenario understanding in autonomous driving.

A Framework for a Capability-driven Evaluation of Scenario Understanding for Multimodal Large Language Models in Autonomous Driving

TL;DR

The paper introduces a capability-driven framework to holistically evaluate multimodal large language models in autonomous driving, organizing scenario understanding into semantic, spatial, temporal, and physical dimensions with anticipation linking them. It formalizes context, modalities, and downstream tasks, and demonstrates applicability through two realistic scenarios at an urban intersection. The work surveys related literature on human driving capabilities, driving-oriented LLMs, and benchmarks, then synthesizes a unified framework and evaluation path. This framework aims to guide structured benchmarking, dataset design, and model development toward interpretable, generalizable, and safe language-guided autonomous driving systems.

Abstract

Multimodal large language models (MLLMs) hold the potential to enhance autonomous driving by combining domain-independent world knowledge with context-specific language guidance. Their integration into autonomous driving systems shows promising results in isolated proof-of-concept applications, while their performance is evaluated on selective singular aspects of perception, reasoning, or planning. To leverage their full potential a systematic framework for evaluating MLLMs in the context of autonomous driving is required. This paper proposes a holistic framework for a capability-driven evaluation of MLLMs in autonomous driving. The framework structures scenario understanding along the four core capability dimensions semantic, spatial, temporal, and physical. They are derived from the general requirements of autonomous driving systems, human driver cognition, and language-based reasoning. It further organises the domain into context layers, processing modalities, and downstream tasks such as language-based interaction and decision-making. To illustrate the framework's applicability, two exemplary traffic scenarios are analysed, grounding the proposed dimensions in realistic driving situations. The framework provides a foundation for the structured evaluation of MLLMs' potential for scenario understanding in autonomous driving.

Paper Structure

This paper contains 14 sections, 11 equations, 5 figures.

Figures (5)

  • Figure 1: Four descriptive core capability dimensions of MLLMs form the basis for a capability-driven evaluation framework. The anticipation capability links all dimensions to form a holistic understanding relevant to the driving task in traffic scenarios.
  • Figure 2: Framework for a capability-driven assessment of MLLMs' scenario understanding. The environmental context leads via perceptual modalities to key capabilities of scenario description and anticipation, with underlying capability dimensions along the sense-plan-act chain, resulting in executable tasks.
  • Figure 3: Four core dimensions of the descriptive capability of MLLMs. The dimensions are separated for explicit evaluation, facilitating targeted improvements of models.
  • Figure 4: Scenario 1 shows a taxi pick-up situation at the roadside with a pedestrian crossing towards the taxi. A plastic bottle is lying in front of the ego vehicle (grey).
  • Figure 5: Scenario 2 shows a cyclist approaching a pedestrian crossing. The cyclist is occluded by a turning yellow bus from the perspective of the ego vehicles (grey).