Table of Contents
Fetching ...

Traffic-MLLM: Curiosity-Regularized Supervised Learning for Traffic Scenario Case-Based Reasoning

Waikit Xiu, Qiang Lu, Bingchen Liu, Chen Sun, Xiying Li

TL;DR

Traffic-MLLM is proposed, a retrieval-free neural case modeling framework for multimodal traffic reasoning that achieves consistent improvements in dynamic reasoning, regulatory understanding, and cross-domain transfer and introduces a curiosity-driven refinement mechanism based on Random Network Distillation.

Abstract

For safe and robust autonomous driving, decision-making systems must effectively leverage past experiences to handle the inherent long-tail of traffic scenarios. Case-Based Reasoning (CBR) provides a natural paradigm for this by adapting solutions from prior cases. However, in complex and dynamic traffic environments, traditional CBR methods struggle to effectively abstract and adapt knowledge under uncertainty. Meanwhile, although multimodal large language models (MLLMs) exhibit strong perceptual and linguistic capabilities, their reasoning behavior often relies on empirical pattern fitting, limiting robustness under distribution shift and long-tail scenarios. We propose Traffic-MLLM, a retrieval-free neural case modeling framework for multimodal traffic reasoning. Instead of performing explicit case retrieval at inference time, Traffic-MLLM learns a structured and generalizable case space directly during training. To support this learning process, we construct a multi-source case base by integrating dynamic traffic videos and large-scale static visual question-answering data, serving as a unified training substrate for learning structured case representations. To further improve representation quality near knowledge boundaries, we introduce a curiosity-driven refinement mechanism based on Random Network Distillation (RND), encouraging the model to internalize cross-case structural regularities rather than surface correlations. Experiments on the SUTD-TrafficQA and DriveQA benchmarks demonstrate consistent improvements in dynamic reasoning, regulatory understanding, and cross-domain transfer. Traffic-MLLM achieves 50.8% accuracy on SUTD-TrafficQA, 74.8% on the CARLA-based DriveQA split, and 83.1% on the real-world Mapillary split, indicating that representation-level case-space refinement provides an effective alternative to explicit retrieval for scalable multimodal case adaptation.

Traffic-MLLM: Curiosity-Regularized Supervised Learning for Traffic Scenario Case-Based Reasoning

TL;DR

Traffic-MLLM is proposed, a retrieval-free neural case modeling framework for multimodal traffic reasoning that achieves consistent improvements in dynamic reasoning, regulatory understanding, and cross-domain transfer and introduces a curiosity-driven refinement mechanism based on Random Network Distillation.

Abstract

For safe and robust autonomous driving, decision-making systems must effectively leverage past experiences to handle the inherent long-tail of traffic scenarios. Case-Based Reasoning (CBR) provides a natural paradigm for this by adapting solutions from prior cases. However, in complex and dynamic traffic environments, traditional CBR methods struggle to effectively abstract and adapt knowledge under uncertainty. Meanwhile, although multimodal large language models (MLLMs) exhibit strong perceptual and linguistic capabilities, their reasoning behavior often relies on empirical pattern fitting, limiting robustness under distribution shift and long-tail scenarios. We propose Traffic-MLLM, a retrieval-free neural case modeling framework for multimodal traffic reasoning. Instead of performing explicit case retrieval at inference time, Traffic-MLLM learns a structured and generalizable case space directly during training. To support this learning process, we construct a multi-source case base by integrating dynamic traffic videos and large-scale static visual question-answering data, serving as a unified training substrate for learning structured case representations. To further improve representation quality near knowledge boundaries, we introduce a curiosity-driven refinement mechanism based on Random Network Distillation (RND), encouraging the model to internalize cross-case structural regularities rather than surface correlations. Experiments on the SUTD-TrafficQA and DriveQA benchmarks demonstrate consistent improvements in dynamic reasoning, regulatory understanding, and cross-domain transfer. Traffic-MLLM achieves 50.8% accuracy on SUTD-TrafficQA, 74.8% on the CARLA-based DriveQA split, and 83.1% on the real-world Mapillary split, indicating that representation-level case-space refinement provides an effective alternative to explicit retrieval for scalable multimodal case adaptation.

Paper Structure

This paper contains 19 sections, 11 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of Traffic-MLLM. The framework unifies dynamic video reasoning and static image-based question answering through a vision--text encoder--fusion--decoder architecture. Curiosity-regularized training is applied on decoder representations without modifying the forward inference structure.
  • Figure 2: Curiosity-driven case-space optimization. Latent case representations obtained via masked pooling are evaluated by an RND module to estimate structural novelty. The resulting intrinsic signal adaptively reweights supervision, encouraging the model to allocate learning capacity toward under-represented or uncertain cases.
  • Figure 3: Representative qualitative examples of Traffic-MLLM. The model handles both dynamic traffic reasoning (future actions, counterfactual analysis, and risk prediction) and static scene understanding (traffic sign interpretation), demonstrating consistent reasoning behavior across diverse traffic scenarios.