Table of Contents
Fetching ...

Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?

Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang, Xiangyu Zhang

TL;DR

This work shows that 2D-tokenized LLMs lack essential 3D priors for reliable autonomous driving. It introduces Atlas, a 3D-tokenized LLM framework that uses DETR-style 3D perceptrons (StreamPETR/TopoMLP) as 3D tokenizers tied to a Vicuna LLM, enabling high-resolution multi-view perception with temporal propagation. On nuScenes, Atlas substantially improves 3D object/detection and planning performance, achieving notable gains in L2 error and collision rate over state-of-the-art baselines. The results support the claim that 3D priors in tokenization are crucial for robust end-to-end autonomous driving, and the work outlines practical avenues for future refinements and broader deployment.

Abstract

Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate robust logical reasoning and cognitive abilities to enable comprehensive end-to-end planning. However, these VLM-based approaches tend to integrate 2D vision tokenizers and a large language model (LLM) for ego-car planning, which lack 3D geometric priors as a cornerstone of reliable planning. Naturally, this observation raises a critical concern: Can a 2D-tokenized LLM accurately perceive the 3D environment? Our evaluation of current VLM-based methods across 3D object detection, vectorized map construction, and environmental caption suggests that the answer is, unfortunately, NO. In other words, 2D-tokenized LLM fails to provide reliable autonomous driving. In response, we introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector. This simple yet elegant strategy, termed Atlas, harnesses the inherent priors of the 3D physical world, enabling it to simultaneously process high-resolution multi-view images and employ spatiotemporal modeling. Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks on nuScenes dataset, proving that 3D-tokenized LLM is the key to reliable autonomous driving. The code and datasets will be released.

Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?

TL;DR

This work shows that 2D-tokenized LLMs lack essential 3D priors for reliable autonomous driving. It introduces Atlas, a 3D-tokenized LLM framework that uses DETR-style 3D perceptrons (StreamPETR/TopoMLP) as 3D tokenizers tied to a Vicuna LLM, enabling high-resolution multi-view perception with temporal propagation. On nuScenes, Atlas substantially improves 3D object/detection and planning performance, achieving notable gains in L2 error and collision rate over state-of-the-art baselines. The results support the claim that 3D priors in tokenization are crucial for robust end-to-end autonomous driving, and the work outlines practical avenues for future refinements and broader deployment.

Abstract

Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate robust logical reasoning and cognitive abilities to enable comprehensive end-to-end planning. However, these VLM-based approaches tend to integrate 2D vision tokenizers and a large language model (LLM) for ego-car planning, which lack 3D geometric priors as a cornerstone of reliable planning. Naturally, this observation raises a critical concern: Can a 2D-tokenized LLM accurately perceive the 3D environment? Our evaluation of current VLM-based methods across 3D object detection, vectorized map construction, and environmental caption suggests that the answer is, unfortunately, NO. In other words, 2D-tokenized LLM fails to provide reliable autonomous driving. In response, we introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector. This simple yet elegant strategy, termed Atlas, harnesses the inherent priors of the 3D physical world, enabling it to simultaneously process high-resolution multi-view images and employ spatiotemporal modeling. Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks on nuScenes dataset, proving that 3D-tokenized LLM is the key to reliable autonomous driving. The code and datasets will be released.
Paper Structure (27 sections, 10 figures, 9 tables)

This paper contains 27 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Comparision among end-to-end methods. (a) Modular BEV-based methods have three sequential modules for perception, prediction, and planning, but they cannot provide multiple potential trajectories and environment reasoning. (b) 2D-tokenized VLM projects 2D distorted images into tokens, which lack 3D prior for reliable autonomous driving. (c) Our 3D-tokenized LLM-based methods utilize 3D perceptions as 3D tokenizers, which provide potential trajectories and rich 3D priors for reliable driving.
  • Figure 2: Brief answer format of datasets. It transforms several tasks, such as 3D object detection, map perception, environment caption, and ego-car planning, into a uniform text format. We discretize the bird's-eye view (BEV) space, spanning from -50 meters to +50 meters, into 1,000 bins.
  • Figure 3: Comparsion between 2D-tokenized and our 3D-tokenized VLMs on driving caption. The 2D-tokenized VLM sometimes generates "hallucinated" descriptions, while our 3D-tokenized VLM is able to produce accurate and comprehensive captions for driving environment.
  • Figure 4: Qualitative results with diverse planning from Atlas. The five planning trajectories presented here are generated through five iterations of utilizing our 3D-tokenized LLM. It is obvious that Atlas is able to output different potential planning trajectories thanks to LLM.
  • Figure 5: Our constructed question-answer pairs for VLM-based methods. It transforms several critical driving reasoning tasks, such as 3D object detection, map perception, environment caption, and ego-car planning, into a uniform text format.
  • ...and 5 more figures