Table of Contents
Fetching ...

Application of Multimodal Large Language Models in Autonomous Driving

Md Robiul Islam

TL;DR

The paper addresses safety and adaptability gaps in autonomous driving by leveraging Multimodal Large Language Models (MLLMs) as decision-making agents. It proposes a domain-adaptive CogVLM2 with a Visual Question Answering (VQA) dataset and a Chain-of-Thought (CoT) reasoning pipeline that partitions decisions into scene understanding, prediction, and action. Key contributions include a VQA dataset construction approach from BDD100k and KITTI with iterative annotation, a domain-adapted AD agent, and empirical evidence of improved scene interpretation, forecasting, and decision quality in Highway-env, compared to baselines. The work highlights practical benefits of MLLMs for AD, while acknowledging limitations in dynamic environments, generalization, and multilingual deployment, and outlines directions for robustness and local deployment.

Abstract

In this era of technological advancements, several cutting-edge techniques are being implemented to enhance Autonomous Driving (AD) systems, focusing on improving safety, efficiency, and adaptability in complex driving environments. However, AD still faces some problems including performance limitations. To address this problem, we conducted an in-depth study on implementing the Multi-modal Large Language Model. We constructed a Virtual Question Answering (VQA) dataset to fine-tune the model and address problems with the poor performance of MLLM on AD. We then break down the AD decision-making process by scene understanding, prediction, and decision-making. Chain of Thought has been used to make the decision more perfectly. Our experiments and detailed analysis of Autonomous Driving give an idea of how important MLLM is for AD.

Application of Multimodal Large Language Models in Autonomous Driving

TL;DR

The paper addresses safety and adaptability gaps in autonomous driving by leveraging Multimodal Large Language Models (MLLMs) as decision-making agents. It proposes a domain-adaptive CogVLM2 with a Visual Question Answering (VQA) dataset and a Chain-of-Thought (CoT) reasoning pipeline that partitions decisions into scene understanding, prediction, and action. Key contributions include a VQA dataset construction approach from BDD100k and KITTI with iterative annotation, a domain-adapted AD agent, and empirical evidence of improved scene interpretation, forecasting, and decision quality in Highway-env, compared to baselines. The work highlights practical benefits of MLLMs for AD, while acknowledging limitations in dynamic environments, generalization, and multilingual deployment, and outlines directions for robustness and local deployment.

Abstract

In this era of technological advancements, several cutting-edge techniques are being implemented to enhance Autonomous Driving (AD) systems, focusing on improving safety, efficiency, and adaptability in complex driving environments. However, AD still faces some problems including performance limitations. To address this problem, we conducted an in-depth study on implementing the Multi-modal Large Language Model. We constructed a Virtual Question Answering (VQA) dataset to fine-tune the model and address problems with the poor performance of MLLM on AD. We then break down the AD decision-making process by scene understanding, prediction, and decision-making. Chain of Thought has been used to make the decision more perfectly. Our experiments and detailed analysis of Autonomous Driving give an idea of how important MLLM is for AD.

Paper Structure

This paper contains 19 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1:
  • Figure 2:
  • Figure 3:
  • Figure 5: The model's step-by-step thinking chain generates information in a progressive manner, leading to more interpretable results.
  • Figure 6: Visual Question Answer (VQA) dataset
  • ...and 6 more figures