Table of Contents
Fetching ...

DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving

Enhui Ma, Jiahuan Zhang, Guantian Zheng, Tao Tang, Shengbo Eben Li, Yuhang Lu, Xia Zhou, Xueyang Zhang, Yifei Zhan, Kun Zhan, Zhihui Hao, Xianpeng Lang, Kaicheng Yu

TL;DR

A systematic Five-Level Cognitive Ladder that evaluates reasoning from single-rule understanding to multi-rule integration and conflict resolution, enabling quantitative assessment across cognitive stages is proposed, highlighting the effectiveness of DriveCombo in advancing compliant and intelligent autonomous driving systems.

Abstract

Multimodal Large Language Models (MLLMs) are rapidly becoming the intelligence brain of end-to-end autonomous driving systems. A key challenge is to assess whether MLLMs can truly understand and follow complex real-world traffic rules. However, existing benchmarks mainly focus on single-rule scenarios like traffic sign recognition, neglecting the complexity of multi-rule concurrency and conflicts in real driving. Consequently, models perform well on simple tasks but often fail or violate rules in real world complex situations. To bridge this gap, we propose DriveCombo, a text and vision-based benchmark for compositional traffic rule reasoning. Inspired by human drivers' cognitive development, we propose a systematic Five-Level Cognitive Ladder that evaluates reasoning from single-rule understanding to multi-rule integration and conflict resolution, enabling quantitative assessment across cognitive stages. We further propose a Rule2Scene Agent that maps language-based traffic rules to dynamic driving scenes through rule crafting and scene generation, enabling scene-level traffic rule visual reasoning. Evaluations of 14 mainstream MLLMs reveal performance drops as task complexity grows, particularly during rule conflicts. After splitting the dataset and fine-tuning on the training set, we further observe substantial improvements in both traffic rule reasoning and downstream planning capabilities. These results highlight the effectiveness of DriveCombo in advancing compliant and intelligent autonomous driving systems.

DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving

TL;DR

A systematic Five-Level Cognitive Ladder that evaluates reasoning from single-rule understanding to multi-rule integration and conflict resolution, enabling quantitative assessment across cognitive stages is proposed, highlighting the effectiveness of DriveCombo in advancing compliant and intelligent autonomous driving systems.

Abstract

Multimodal Large Language Models (MLLMs) are rapidly becoming the intelligence brain of end-to-end autonomous driving systems. A key challenge is to assess whether MLLMs can truly understand and follow complex real-world traffic rules. However, existing benchmarks mainly focus on single-rule scenarios like traffic sign recognition, neglecting the complexity of multi-rule concurrency and conflicts in real driving. Consequently, models perform well on simple tasks but often fail or violate rules in real world complex situations. To bridge this gap, we propose DriveCombo, a text and vision-based benchmark for compositional traffic rule reasoning. Inspired by human drivers' cognitive development, we propose a systematic Five-Level Cognitive Ladder that evaluates reasoning from single-rule understanding to multi-rule integration and conflict resolution, enabling quantitative assessment across cognitive stages. We further propose a Rule2Scene Agent that maps language-based traffic rules to dynamic driving scenes through rule crafting and scene generation, enabling scene-level traffic rule visual reasoning. Evaluations of 14 mainstream MLLMs reveal performance drops as task complexity grows, particularly during rule conflicts. After splitting the dataset and fine-tuning on the training set, we further observe substantial improvements in both traffic rule reasoning and downstream planning capabilities. These results highlight the effectiveness of DriveCombo in advancing compliant and intelligent autonomous driving systems.
Paper Structure (71 sections, 6 equations, 13 figures, 21 tables)

This paper contains 71 sections, 6 equations, 13 figures, 21 tables.

Figures (13)

  • Figure 1: (a) Existing traffic rule-focused benchmarks wei2025driveqalu2025idkb mainly assess single-rule understanding, such as recognizing traffic signs or simple right-of-way cases, resulting in flat difficulty and limited rule reasoning. In contrast, our DriveCombo introduces compositional traffic rule scenarios with leveled cognitive difficulty, enabling systematic evaluation of multimodal large language models (MLLMs) from single-rule understanding to multi-rule integration and conflict resolution. (b) Model performance across our Five-Level Cognitive Ladder. For simplicity, we present results for five representative models, which show a consistent decline in accuracy as reasoning complexity increases, especially at Level 5 conflict-resolution task, revealing current limitations in compositional traffic rule reasoning of MLLMs.
  • Figure 2: Task Examples of the Five-Level Cognitive Ladder in DriveCombo, progressing from single-rule understanding (L1) to multi-rule integration and conflict resolution (L5). Each level presents a scenario with corresponding traffic rules bases, visual context, and multiple choice questions, enabling systematic evaluation of MLLMs’ compositional traffic rule reasoning under increasing cognitive complexity. The colored texts in "Scenario Description" correspond to the same colors in "Rule Basis", as well as the colored boxes in the visual context, indicating the rule-to-scene alignment.
  • Figure 3: Data Distribution across (a) action types, (b) road types, (c) weather conditions, and (d) country sources in DriveCombo.
  • Figure 4: Rule2Scene Agent. The agent consists of two modules: (a) Rule Crafter performs semantic structuring of atomic rules $r_i \in R$, generates candidate rule pairs $p_i$, verifies spatiotemporal coexistence to construct a hierarchical rule set $M$; (b) Scene Weaver converts the hierarchical rules $m_i \in M$ into textual scene descriptions $s_i$, generates structured semantic representations $w_i$, maps them to the CARLA simulator dosovitskiy2017carla, renders and captures images $i_i$, and finally generates high-fidelity driving scenarios for model evaluation.
  • Figure 5: Performance of MLLMs on Multi-Rule Compositional Reasoning. "#Rules" means the number of traffic rules within each scenario. The results are obtained in a zero-shot setting. Since L1 is single-rule setting, we only present L2-L5 here.
  • ...and 8 more figures