Table of Contents
Fetching ...

DriveCode: Domain Specific Numerical Encoding for LLM-Based Autonomous Driving

Zhiye Wang, Yanbo Jiang, Rui Zhou, Bo Zhang, Fang Zhang, Zhenhua Xu, Yaqin Zhang, Jianqiang Wang

TL;DR

DriveCode is introduced, a novel numerical encoding method that represents numbers as dedicated embeddings rather than discrete text tokens that demonstrates superior performance in trajectory prediction and control signal generation, confirming its effectiveness for LLM-based autonomous driving systems.

Abstract

Large language models (LLMs) have shown great promise for autonomous driving. However, discretizing numbers into tokens limits precise numerical reasoning, fails to reflect the positional significance of digits in the training objective, and makes it difficult to achieve both decoding efficiency and numerical precision. These limitations affect both the processing of sensor measurements and the generation of precise control commands, creating a fundamental barrier for deploying LLM-based autonomous driving systems. In this paper, we introduce DriveCode, a novel numerical encoding method that represents numbers as dedicated embeddings rather than discrete text tokens. DriveCode employs a number projector to map numbers into the language model's hidden space, enabling seamless integration with visual and textual features in a unified multimodal sequence. Evaluated on OmniDrive, DriveGPT4, and DriveGPT4-V2 datasets, DriveCode demonstrates superior performance in trajectory prediction and control signal generation, confirming its effectiveness for LLM-based autonomous driving systems.

DriveCode: Domain Specific Numerical Encoding for LLM-Based Autonomous Driving

TL;DR

DriveCode is introduced, a novel numerical encoding method that represents numbers as dedicated embeddings rather than discrete text tokens that demonstrates superior performance in trajectory prediction and control signal generation, confirming its effectiveness for LLM-based autonomous driving systems.

Abstract

Large language models (LLMs) have shown great promise for autonomous driving. However, discretizing numbers into tokens limits precise numerical reasoning, fails to reflect the positional significance of digits in the training objective, and makes it difficult to achieve both decoding efficiency and numerical precision. These limitations affect both the processing of sensor measurements and the generation of precise control commands, creating a fundamental barrier for deploying LLM-based autonomous driving systems. In this paper, we introduce DriveCode, a novel numerical encoding method that represents numbers as dedicated embeddings rather than discrete text tokens. DriveCode employs a number projector to map numbers into the language model's hidden space, enabling seamless integration with visual and textual features in a unified multimodal sequence. Evaluated on OmniDrive, DriveGPT4, and DriveGPT4-V2 datasets, DriveCode demonstrates superior performance in trajectory prediction and control signal generation, confirming its effectiveness for LLM-based autonomous driving systems.
Paper Structure (26 sections, 14 equations, 4 figures, 6 tables)

This paper contains 26 sections, 14 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: A sample procedure of DriveCode. Numbers are first extracted from text prompts and then processed by a number projector to achieve continuous number processing.
  • Figure 2: DriveCode overview. Our proposed approach consists of three parts: image projection, text tokenization and number projection. The images are first encoded by a vision tower and projected into the language embedding space via an image projector. In parallel, textual descriptions and instructions are tokenized. The third part is the main contribution of our work: continuous numerical signals are vectorized through a dedicated number projector to form aligned numerical tokens. These visual, textual, and numerical tokens are concatenated into a unified sequence and processed by an LLM for further training and inference.
  • Figure 3: Parallel autoregressive generation of text and numbers. Numbers are projected and fed into the next step without conversion to text embeddings.
  • Figure 4: Examples of three datasets. All numbers in these datasets are replaced with < number_token>.