OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving

Shuo Xing; Chengyuan Qian; Yuping Wang; Hongyuan Hua; Kexin Tian; Yang Zhou; Zhengzhong Tu

OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving

Shuo Xing, Chengyuan Qian, Yuping Wang, Hongyuan Hua, Kexin Tian, Yang Zhou, Zhengzhong Tu

TL;DR

OpenEMMA tackles the barrier to research in end-to-end autonomous driving by delivering an open-source framework that leverages Multimodal Large Language Models with chain-of-thought reasoning to plan trajectories from front-camera input and ego history. It combines a compute-efficient two-stage reasoning and predicting process to output speed and curvature, which are integrated into a reachable ego-path, and enhances perception with a monocular 3D detector based on YOLO3D. The work demonstrates robustness and generalizability across diverse nuScenes scenarios, with extensive experiments and qualitative visualizations, and releases the full codebase for community use. By merging interpretable reasoning with external visual grounding, OpenEMMA lowers the barriers to studying and deploying end-to-end AD systems while highlighting directions for future improvements in MLLM grounding and inference-time reasoning.

Abstract

Since the advent of Multimodal Large Language Models (MLLMs), they have made a significant impact across a wide range of real-world applications, particularly in Autonomous Driving (AD). Their ability to process complex visual data and reason about intricate driving scenarios has paved the way for a new paradigm in end-to-end AD systems. However, the progress of developing end-to-end models for AD has been slow, as existing fine-tuning methods demand substantial resources, including extensive computational power, large-scale datasets, and significant funding. Drawing inspiration from recent advancements in inference computing, we propose OpenEMMA, an open-source end-to-end framework based on MLLMs. By incorporating the Chain-of-Thought reasoning process, OpenEMMA achieves significant improvements compared to the baseline when leveraging a diverse range of MLLMs. Furthermore, OpenEMMA demonstrates effectiveness, generalizability, and robustness across a variety of challenging driving scenarios, offering a more efficient and effective approach to autonomous driving. We release all the codes in https://github.com/taco-group/OpenEMMA.

OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving

TL;DR

Abstract

OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)