Table of Contents
Fetching ...

UniLION: Towards Unified Autonomous Driving Model with Linear Group RNNs

Zhe Liu, Jinghua Hou, Xiaoqing Ye, Jingdong Wang, Hengshuang Zhao, Xiang Bai

TL;DR

This work addresses the computational bottleneck of transformer-based attention on long autonomous driving sequences by introducing UniLION, a unified 3D backbone built on linear group RNNs with linear complexity $O(n)$ (vs. $O(n^2)$ for attention). By directly concatenating multi-modal and temporal tokens, UniLION eliminates explicit fusion modules and produces a compact BEV representation for parallel perception, prediction, and planning tasks. The authors demonstrate unified inputs, a single adaptable model across sensor configurations, and strong state-of-the-art results across six nuScenes tasks, including 3D detection, tracking, BEV map segmentation, occupancy prediction, motion prediction, and planning. This approach offers a scalable, robust paradigm for 3D foundation models in autonomous driving with favorable efficiency due to linear computational complexity.

Abstract

Although transformers have demonstrated remarkable capabilities across various domains, their quadratic attention mechanisms introduce significant computational overhead when processing long-sequence data. In this paper, we present a unified autonomous driving model, UniLION, which efficiently handles large-scale LiDAR point clouds, high-resolution multi-view images, and even temporal sequences based on the linear group RNN operator (i.e., performs linear RNN for grouped features). Remarkably, UniLION serves as a single versatile architecture that can seamlessly support multiple specialized variants (i.e., LiDAR-only, temporal LiDAR, multi-modal, and multi-modal temporal fusion configurations) without requiring explicit temporal or multi-modal fusion modules. Moreover, UniLION consistently delivers competitive and even state-of-the-art performance across a wide range of core tasks, including 3D perception (e.g., 3D object detection, 3D object tracking, 3D occupancy prediction, BEV map segmentation), prediction (e.g., motion prediction), and planning (e.g., end-to-end planning). This unified paradigm naturally simplifies the design of multi-modal and multi-task autonomous driving systems while maintaining superior performance. Ultimately, we hope UniLION offers a fresh perspective on the development of 3D foundation models in autonomous driving. Code is available at https://github.com/happinesslz/UniLION

UniLION: Towards Unified Autonomous Driving Model with Linear Group RNNs

TL;DR

This work addresses the computational bottleneck of transformer-based attention on long autonomous driving sequences by introducing UniLION, a unified 3D backbone built on linear group RNNs with linear complexity (vs. for attention). By directly concatenating multi-modal and temporal tokens, UniLION eliminates explicit fusion modules and produces a compact BEV representation for parallel perception, prediction, and planning tasks. The authors demonstrate unified inputs, a single adaptable model across sensor configurations, and strong state-of-the-art results across six nuScenes tasks, including 3D detection, tracking, BEV map segmentation, occupancy prediction, motion prediction, and planning. This approach offers a scalable, robust paradigm for 3D foundation models in autonomous driving with favorable efficiency due to linear computational complexity.

Abstract

Although transformers have demonstrated remarkable capabilities across various domains, their quadratic attention mechanisms introduce significant computational overhead when processing long-sequence data. In this paper, we present a unified autonomous driving model, UniLION, which efficiently handles large-scale LiDAR point clouds, high-resolution multi-view images, and even temporal sequences based on the linear group RNN operator (i.e., performs linear RNN for grouped features). Remarkably, UniLION serves as a single versatile architecture that can seamlessly support multiple specialized variants (i.e., LiDAR-only, temporal LiDAR, multi-modal, and multi-modal temporal fusion configurations) without requiring explicit temporal or multi-modal fusion modules. Moreover, UniLION consistently delivers competitive and even state-of-the-art performance across a wide range of core tasks, including 3D perception (e.g., 3D object detection, 3D object tracking, 3D occupancy prediction, BEV map segmentation), prediction (e.g., motion prediction), and planning (e.g., end-to-end planning). This unified paradigm naturally simplifies the design of multi-modal and multi-task autonomous driving systems while maintaining superior performance. Ultimately, we hope UniLION offers a fresh perspective on the development of 3D foundation models in autonomous driving. Code is available at https://github.com/happinesslz/UniLION

Paper Structure

This paper contains 16 sections, 4 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: (a) presents the mainstream methods in implementing multi-modal fusion or temporal fusion. (b) illustrates the classic pipeline for achieving the end-to-end autonomous driving system. (c) demonstrates our method UniLION, which elegantly unifies multiple input modalities and temporal sequences into a single, versatile architecture. UniLION can seamlessly support multiple specialized variants (i.e., LiDAR-only, temporal LiDAR, multi-modal, and multi-modal temporal fusion configurations) without explicit temporal or multi-modal fusion modules. Moreover, UniLION enables concurrent execution of multiple downstream tasks in a decoupled manner through a shared BEV feature representation, leveraging the comprehensive and superior feature extraction capabilities of its 3D backbone.
  • Figure 2: We propose UniLION, a unified model that achieves both latent temporal fusion and multi-modal fusion in UniLION backbone by the linear group RNN, generating the unified BEV features that serve all autonomous driving tasks, including perception, prediction, and planning. UniLION mainly consists of $N$ UniLION blocks, each paired with a voxel generation for feature enhancement and a voxel merging for down-sampling features along the height dimension. $(H, W, D)$ indicates the shape of the 3D feature map, where $H$, $W$, and $D$ are the length, width, and height of the 3D feature map along the X-axis, Y-axis, and Z-axis. $N$ is the number of UniLION blocks. In UniLION, we first partition input multi-modal voxels into a series of equal-size groups. Then, we feed these grouped features into UniLION 3D backbone to enhance their feature representation. Finally, these enhanced features are fed into a BEV backbone to generate unified BEV features for all tasks.
  • Figure 3: (a) shows the structure of UniLION block, which contains four UniLION layers, two voxel merging operations, two voxel expanding operations, and three 3D spatial feature descriptors. Here, $1\times$, $\frac{1}{2}\times$, and $\frac{1}{4}\times$ indicate the resolution of 3D feature map as $(H,W,D)$, $(H/2,W/2,D/2)$ and $(H/4,W/4,D/4)$, respectively. (b) is the illustration of voxel merging for voxel down-sampling and voxel expanding for voxel up-sampling. We use voxel merging to merge input LiDAR voxels, camera voxels, and temporal voxels to achieve multi-modal fusion and temporal fusion. (c) presents the structure of UniLION layer. (d) shows the details of the 3D spatial feature descriptor.
  • Figure 4: The illustration of spatial information loss when flattening into 1D sequences. For example, there are two adjacent voxels in spatial position (indexed as 01 and 34) but are far in the 1D sequences along the X order.
  • Figure 5: The illustration of voxel generation. We first select the foreground voxels among LiDAR voxels, camera voxels, and temporal voxels and diffuse them along different directions. Then, we initialize the corresponding features of the diffused voxels as zeros and utilize the auto-regressive ability of the following UniLION block to generate diffused features.
  • ...and 1 more figures