Table of Contents
Fetching ...

Doe-1: Closed-Loop Autonomous Driving with Large World Model

Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, Jiwen Lu

TL;DR

Doe-1 introduces a closed-loop autonomous driving framework that unifies perception, prediction, and planning as a single autoregressive process over multimodal tokens. By tokenizing observations, descriptions, and actions, it predicts the next tokens to generate future observations, descriptions, and actions conditioned on ego behavior, enabling prompt-driven visual QA, action-conditioned video generation, and end-to-end motion planning without fine-tuning. Evaluations on nuScenes demonstrate competitive performance across VQA, video generation, and planning tasks using only front-view input, highlighting the scalability and versatility of a large driving world model. The work paves the way for scalable, interpretable autonomous driving through unified token-based world modeling, while noting the current limitation of single-view inputs and the need for surround-view integration.

Abstract

End-to-end autonomous driving has received increasing attention due to its potential to learn from large amounts of data. However, most existing methods are still open-loop and suffer from weak scalability, lack of high-order interactions, and inefficient decision-making. In this paper, we explore a closed-loop framework for autonomous driving and propose a large Driving wOrld modEl (Doe-1) for unified perception, prediction, and planning. We formulate autonomous driving as a next-token generation problem and use multi-modal tokens to accomplish different tasks. Specifically, we use free-form texts (i.e., scene descriptions) for perception and generate future predictions directly in the RGB space with image tokens. For planning, we employ a position-aware tokenizer to effectively encode action into discrete tokens. We train a multi-modal transformer to autoregressively generate perception, prediction, and planning tokens in an end-to-end and unified manner. Experiments on the widely used nuScenes dataset demonstrate the effectiveness of Doe-1 in various tasks including visual question-answering, action-conditioned video generation, and motion planning. Code: https://github.com/wzzheng/Doe.

Doe-1: Closed-Loop Autonomous Driving with Large World Model

TL;DR

Doe-1 introduces a closed-loop autonomous driving framework that unifies perception, prediction, and planning as a single autoregressive process over multimodal tokens. By tokenizing observations, descriptions, and actions, it predicts the next tokens to generate future observations, descriptions, and actions conditioned on ego behavior, enabling prompt-driven visual QA, action-conditioned video generation, and end-to-end motion planning without fine-tuning. Evaluations on nuScenes demonstrate competitive performance across VQA, video generation, and planning tasks using only front-view input, highlighting the scalability and versatility of a large driving world model. The work paves the way for scalable, interpretable autonomous driving through unified token-based world modeling, while noting the current limitation of single-view inputs and the need for surround-view integration.

Abstract

End-to-end autonomous driving has received increasing attention due to its potential to learn from large amounts of data. However, most existing methods are still open-loop and suffer from weak scalability, lack of high-order interactions, and inefficient decision-making. In this paper, we explore a closed-loop framework for autonomous driving and propose a large Driving wOrld modEl (Doe-1) for unified perception, prediction, and planning. We formulate autonomous driving as a next-token generation problem and use multi-modal tokens to accomplish different tasks. Specifically, we use free-form texts (i.e., scene descriptions) for perception and generate future predictions directly in the RGB space with image tokens. For planning, we employ a position-aware tokenizer to effectively encode action into discrete tokens. We train a multi-modal transformer to autoregressively generate perception, prediction, and planning tokens in an end-to-end and unified manner. Experiments on the widely used nuScenes dataset demonstrate the effectiveness of Doe-1 in various tasks including visual question-answering, action-conditioned video generation, and motion planning. Code: https://github.com/wzzheng/Doe.

Paper Structure

This paper contains 16 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Visualizations of our Doe-1 for closed-loop autonomous driving on nuScenes nuscenes. We propose a large driving world model (Doe-1) to achieve unified generative closed-loop autonomous driving. We model perception, prediction, and planning as the transitions of observation$\rightarrow$description, description$\rightarrow$action, and action$\rightarrow$observation, respectively. Doe-1 accomplishes perception, planning, and prediction in a unified autoregressive generative framework and achieves closed-loop end-to-end autonomous driving for the first time.
  • Figure 2: Overview of the proposed Doe-1. We formulate autonomous driving as a unified next-token generation problem and use observation, description, and action tokens to represent each scene. Without additional fine-tuning, Doe-1 accomplishes various tasks by using different input prompts, including visual question-answering, controlled image generation, and end-to-end motion planning.
  • Figure 3: Comparisons of different paradigms. (a) The modular end-to-end model performs perception, prediction, and planning sequentially and is the most popular pipeline for autonomous driving. (b) The direct end-to-end model directly outputs the planned action given sensor inputs. (c) The LLM/VLM-based model exploits the reasoning ability of LLMs/VLMs to output actions. (d) The proposed driving world model (Doe-1) predicts the evolutions between observations, descriptions, and actions to achieve close-loop end-to-end autonomous driving.
  • Figure 4: Illustration of the proposed closed-loop autonomous driving paradigm. (a) Existing end-to-end autonomous driving methods (e.g., UniAD uniad, GenAD genad) usually perform perception first and then make decisions according to the perceived descriptions. (b) Existing world models for autonomous driving (e.g., DriveDreamer drivedreamer, OccWorld occworld) predict future observations based on the current actions. (c) Close-loop autonomous driving combines the two paradigms to construct a closed loop.
  • Figure 5: Framework of the proposed Doe-1. We first re-organize the training dataset into a temporal sequence of sensor data (image), perception data (texts), and action data (position of the next frame). We then use image, text, and action tokenizers to encode them into discrete tokens to construct a 1D token sequence. We then use a transformer-based architecture to autoregressively model this sequence and use the next-token prediction as the training objective.
  • ...and 3 more figures