UniLION: Towards Unified Autonomous Driving Model with Linear Group RNNs
Zhe Liu, Jinghua Hou, Xiaoqing Ye, Jingdong Wang, Hengshuang Zhao, Xiang Bai
TL;DR
This work addresses the computational bottleneck of transformer-based attention on long autonomous driving sequences by introducing UniLION, a unified 3D backbone built on linear group RNNs with linear complexity $O(n)$ (vs. $O(n^2)$ for attention). By directly concatenating multi-modal and temporal tokens, UniLION eliminates explicit fusion modules and produces a compact BEV representation for parallel perception, prediction, and planning tasks. The authors demonstrate unified inputs, a single adaptable model across sensor configurations, and strong state-of-the-art results across six nuScenes tasks, including 3D detection, tracking, BEV map segmentation, occupancy prediction, motion prediction, and planning. This approach offers a scalable, robust paradigm for 3D foundation models in autonomous driving with favorable efficiency due to linear computational complexity.
Abstract
Although transformers have demonstrated remarkable capabilities across various domains, their quadratic attention mechanisms introduce significant computational overhead when processing long-sequence data. In this paper, we present a unified autonomous driving model, UniLION, which efficiently handles large-scale LiDAR point clouds, high-resolution multi-view images, and even temporal sequences based on the linear group RNN operator (i.e., performs linear RNN for grouped features). Remarkably, UniLION serves as a single versatile architecture that can seamlessly support multiple specialized variants (i.e., LiDAR-only, temporal LiDAR, multi-modal, and multi-modal temporal fusion configurations) without requiring explicit temporal or multi-modal fusion modules. Moreover, UniLION consistently delivers competitive and even state-of-the-art performance across a wide range of core tasks, including 3D perception (e.g., 3D object detection, 3D object tracking, 3D occupancy prediction, BEV map segmentation), prediction (e.g., motion prediction), and planning (e.g., end-to-end planning). This unified paradigm naturally simplifies the design of multi-modal and multi-task autonomous driving systems while maintaining superior performance. Ultimately, we hope UniLION offers a fresh perspective on the development of 3D foundation models in autonomous driving. Code is available at https://github.com/happinesslz/UniLION
