Table of Contents
Fetching ...

RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving

Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, Lin Ma

TL;DR

RoboTron-Drive introduces an all-in-one large multimodal model for autonomous driving that processes images, multi-view videos, and other sensor data to perform perception, prediction, and planning. It employs a curriculum-style pre-training framework, data augmentation and standardization across six open AD datasets, and a perspective-aware prompting mechanism to achieve broad AD capabilities and strong zero-shot generalization. The approach uses a SigLIP vision encoder and a Llama-3.1-based LLM, with a four-stage training pipeline that progressively increases data and task complexity. Empirical results show state-of-the-art performance across six benchmarks and enhanced generalization to unseen datasets, highlighting the value of cross-dataset training for robust, real-world AD systems.

Abstract

Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose RoboTron-Drive, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD datasets to finetune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on three unseen datasets, where RoboTron-Drive achieves state-of-the-art performance across all tasks. We hope RoboTron-Drive as a promising solution for AD in the real world. Project page with code: https://github.com/zhijian11/RoboTron-Drive.

RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving

TL;DR

RoboTron-Drive introduces an all-in-one large multimodal model for autonomous driving that processes images, multi-view videos, and other sensor data to perform perception, prediction, and planning. It employs a curriculum-style pre-training framework, data augmentation and standardization across six open AD datasets, and a perspective-aware prompting mechanism to achieve broad AD capabilities and strong zero-shot generalization. The approach uses a SigLIP vision encoder and a Llama-3.1-based LLM, with a four-stage training pipeline that progressively increases data and task complexity. Empirical results show state-of-the-art performance across six benchmarks and enhanced generalization to unseen datasets, highlighting the value of cross-dataset training for robust, real-world AD systems.

Abstract

Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose RoboTron-Drive, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD datasets to finetune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on three unseen datasets, where RoboTron-Drive achieves state-of-the-art performance across all tasks. We hope RoboTron-Drive as a promising solution for AD in the real world. Project page with code: https://github.com/zhijian11/RoboTron-Drive.

Paper Structure

This paper contains 38 sections, 4 equations, 20 figures, 11 tables.

Figures (20)

  • Figure 1: achieves SOTA in both general capabilities and generalization ability.Left: outperforms all specific SOTA models and other general large multimodal models across all 6 datasets comprising 13 tasks; Right: In zero-shot learning on unseen datasets kim2018textualmalla2023dramaxie2025drivebench, shows stronger generalization ability compared to models trained on individual datasets.
  • Figure 2: Overview of framwork. We adapt the architecture form of LLaVA liu2024visual with a different model instantiation, processing various visual input signals. We design a perspective-aware prompt to accept multi-perspective inputs in AD scenario. Equipped with diverse AD multimodal data, possesses an all-in-one capability to accomplish multiple tasks in autonomous driving.
  • Figure 3: Illustration of the curriculum learning framework.Stage-1 & Stage-2: it consists of language-image alignment and single-image pre-training, which use the image-text pairs to equip LLM with a foundational capability for single-image comprehension. We refer to the combination of these two stages as image pre-training. Stage-3: we enhance the model’s visual reasoning and perception capabilities across diverse scenarios by training on both the visual instruction tuning data and perception data. Stage-4: we further fine-tune the model on six augmented and standardized autonomous driving datasets, enabling to tackle a wide range of AD tasks.
  • Figure 4: An example of how our model responds to user queries based on multi-view video inputs. Compared to specialist and public models, our model successfully understands the surrounding environment and makes accurate decisions, ultimately outputting the textual answer in the user-specified format.
  • Figure S1: Visualization of CODA-LM. Key information is highlighted in green, while errors are marked in red.
  • ...and 15 more figures