Table of Contents
Fetching ...

Occ-LLM: Enhancing Autonomous Driving with Occupancy-Based Large Language Models

Tianshuo Xu, Hao Lu, Xu Yan, Yingjie Cai, Bingbing Liu, Yingcong Chen

TL;DR

This paper introduces Occ-LLM, an occupancy-based large language model for autonomous driving, addressing the challenge of integrating occupancy grids with language models by using a Motion Separation Variational Autoencoder (MS-VAE) to separate moving and static voxels. The MS-VAE enables efficient encoding and reconstruction of dynamic trajectories and static scenes, while a patch-based latent representation and frame tokens ensure robust LLM input. Occ-LLM demonstrates state-of-the-art performance on 4D occupancy forecasting, self-ego planning, and occupancy-based scene QA, with notable gains in IoU and mIoU and reduced planning error, as well as superior QA metrics over DriveLM. The approach highlights the practical potential of occupancy-guided reasoning for safer, more reliable autonomous driving, enabling richer interactions between perception and language-based planning.

Abstract

Large Language Models (LLMs) have made substantial advancements in the field of robotic and autonomous driving. This study presents the first Occupancy-based Large Language Model (Occ-LLM), which represents a pioneering effort to integrate LLMs with an important representation. To effectively encode occupancy as input for the LLM and address the category imbalances associated with occupancy, we propose Motion Separation Variational Autoencoder (MS-VAE). This innovative approach utilizes prior knowledge to distinguish dynamic objects from static scenes before inputting them into a tailored Variational Autoencoder (VAE). This separation enhances the model's capacity to concentrate on dynamic trajectories while effectively reconstructing static scenes. The efficacy of Occ-LLM has been validated across key tasks, including 4D occupancy forecasting, self-ego planning, and occupancy-based scene question answering. Comprehensive evaluations demonstrate that Occ-LLM significantly surpasses existing state-of-the-art methodologies, achieving gains of about 6\% in Intersection over Union (IoU) and 4\% in mean Intersection over Union (mIoU) for the task of 4D occupancy forecasting. These findings highlight the transformative potential of Occ-LLM in reshaping current paradigms within robotic and autonomous driving.

Occ-LLM: Enhancing Autonomous Driving with Occupancy-Based Large Language Models

TL;DR

This paper introduces Occ-LLM, an occupancy-based large language model for autonomous driving, addressing the challenge of integrating occupancy grids with language models by using a Motion Separation Variational Autoencoder (MS-VAE) to separate moving and static voxels. The MS-VAE enables efficient encoding and reconstruction of dynamic trajectories and static scenes, while a patch-based latent representation and frame tokens ensure robust LLM input. Occ-LLM demonstrates state-of-the-art performance on 4D occupancy forecasting, self-ego planning, and occupancy-based scene QA, with notable gains in IoU and mIoU and reduced planning error, as well as superior QA metrics over DriveLM. The approach highlights the practical potential of occupancy-guided reasoning for safer, more reliable autonomous driving, enabling richer interactions between perception and language-based planning.

Abstract

Large Language Models (LLMs) have made substantial advancements in the field of robotic and autonomous driving. This study presents the first Occupancy-based Large Language Model (Occ-LLM), which represents a pioneering effort to integrate LLMs with an important representation. To effectively encode occupancy as input for the LLM and address the category imbalances associated with occupancy, we propose Motion Separation Variational Autoencoder (MS-VAE). This innovative approach utilizes prior knowledge to distinguish dynamic objects from static scenes before inputting them into a tailored Variational Autoencoder (VAE). This separation enhances the model's capacity to concentrate on dynamic trajectories while effectively reconstructing static scenes. The efficacy of Occ-LLM has been validated across key tasks, including 4D occupancy forecasting, self-ego planning, and occupancy-based scene question answering. Comprehensive evaluations demonstrate that Occ-LLM significantly surpasses existing state-of-the-art methodologies, achieving gains of about 6\% in Intersection over Union (IoU) and 4\% in mean Intersection over Union (mIoU) for the task of 4D occupancy forecasting. These findings highlight the transformative potential of Occ-LLM in reshaping current paradigms within robotic and autonomous driving.

Paper Structure

This paper contains 17 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: We present Occ-LLM, an occupancy-based large language model designed for autonomous driving scene prediction, planning, and understanding (zoom in for the best view).
  • Figure 2: Overview of the proposed Occ-LLM framework. Initially, results from multiview cameras are converted into occupancy representations utilizing existing occupancy prediction algorithms. Subsequently, the Motion Separation strategy is employed to differentiate voxels associated with moving objects from static elements. These differentiated voxels are then independently encoded into latent representations using our custom-designed VAE. Finally, these latents are processed as specified in Section \ref{['sec:occ-pre']} before being integrated into the LLM, completing the preparatory steps for downstream applications.
  • Figure 3: Illustration of the positional shift problem in occupancy representation, where the red dotted line represents the central axis, and the red box signifies the occurrence of an object in the subsequent frame that appears within the current frame. This problem is mitigated by appending tokens to the beginning <occ> and end </occ> of each frame’s latent occupancy representation.
  • Figure 4: Qualitative 4-D occupancy forecasting results of our Occ-LLM. "Vanilla" refers to the direct flattening of occupancy representation and its injection into the LLM for training (zoom in for the best view).
  • Figure 5: Qualitative question-answering results of our Occ-LLM. The left panel displays the raw scene data, while the right panel shows the predicted occupancy generated by FBOCC li2023fbocc. The questions (Q) and the corresponding predicted answers (A) are illustrated (zoom in for the best view).