Table of Contents
Fetching ...

LION: Linear Group RNN for 3D Object Detection in Point Clouds

Zhe Liu, Jinghua Hou, Xinyu Wang, Xiaoqing Ye, Jingdong Wang, Hengshuang Zhao, Xiang Bai

TL;DR

The paper tackles the challenge of modeling long-range relationships in 3D object detection from sparse point clouds under the quadratic cost of transformers. It introduces LION, a window-based backbone built on Linear Group RNNs to enable large-scale feature interaction, augmented by a 3D spatial feature descriptor and a voxel-generation strategy to address local spatial information and sparsity. LION generalizes across multiple linear RNN operators (e.g., Mamba, RWKV, RetNet) and achieves state-of-the-art results on Waymo, nuScenes, Argoverse V2, and ONCE, with competitive KITTI results for quick experimentation. The approach demonstrates the viability of linear RNNs for scalable 3D perception and lays groundwork for future multi-modal or foundational models in 3D vision, albeit with some runtime considerations compared to transformers.

Abstract

The benefit of transformers in large-scale 3D point cloud perception tasks, such as 3D object detection, is limited by their quadratic computation cost when modeling long-range relationships. In contrast, linear RNNs have low computational complexity and are suitable for long-range modeling. Toward this goal, we propose a simple and effective window-based framework built on LInear grOup RNN (i.e., perform linear RNN for grouped features) for accurate 3D object detection, called LION. The key property is to allow sufficient feature interaction in a much larger group than transformer-based methods. However, effectively applying linear group RNN to 3D object detection in highly sparse point clouds is not trivial due to its limitation in handling spatial modeling. To tackle this problem, we simply introduce a 3D spatial feature descriptor and integrate it into the linear group RNN operators to enhance their spatial features rather than blindly increasing the number of scanning orders for voxel features. To further address the challenge in highly sparse point clouds, we propose a 3D voxel generation strategy to densify foreground features thanks to linear group RNN as a natural property of auto-regressive models. Extensive experiments verify the effectiveness of the proposed components and the generalization of our LION on different linear group RNN operators including Mamba, RWKV, and RetNet. Furthermore, it is worth mentioning that our LION-Mamba achieves state-of-the-art on Waymo, nuScenes, Argoverse V2, and ONCE dataset. Last but not least, our method supports kinds of advanced linear RNN operators (e.g., RetNet, RWKV, Mamba, xLSTM and TTT) on small but popular KITTI dataset for a quick experience with our linear RNN-based framework.

LION: Linear Group RNN for 3D Object Detection in Point Clouds

TL;DR

The paper tackles the challenge of modeling long-range relationships in 3D object detection from sparse point clouds under the quadratic cost of transformers. It introduces LION, a window-based backbone built on Linear Group RNNs to enable large-scale feature interaction, augmented by a 3D spatial feature descriptor and a voxel-generation strategy to address local spatial information and sparsity. LION generalizes across multiple linear RNN operators (e.g., Mamba, RWKV, RetNet) and achieves state-of-the-art results on Waymo, nuScenes, Argoverse V2, and ONCE, with competitive KITTI results for quick experimentation. The approach demonstrates the viability of linear RNNs for scalable 3D perception and lays groundwork for future multi-modal or foundational models in 3D vision, albeit with some runtime considerations compared to transformers.

Abstract

The benefit of transformers in large-scale 3D point cloud perception tasks, such as 3D object detection, is limited by their quadratic computation cost when modeling long-range relationships. In contrast, linear RNNs have low computational complexity and are suitable for long-range modeling. Toward this goal, we propose a simple and effective window-based framework built on LInear grOup RNN (i.e., perform linear RNN for grouped features) for accurate 3D object detection, called LION. The key property is to allow sufficient feature interaction in a much larger group than transformer-based methods. However, effectively applying linear group RNN to 3D object detection in highly sparse point clouds is not trivial due to its limitation in handling spatial modeling. To tackle this problem, we simply introduce a 3D spatial feature descriptor and integrate it into the linear group RNN operators to enhance their spatial features rather than blindly increasing the number of scanning orders for voxel features. To further address the challenge in highly sparse point clouds, we propose a 3D voxel generation strategy to densify foreground features thanks to linear group RNN as a natural property of auto-regressive models. Extensive experiments verify the effectiveness of the proposed components and the generalization of our LION on different linear group RNN operators including Mamba, RWKV, and RetNet. Furthermore, it is worth mentioning that our LION-Mamba achieves state-of-the-art on Waymo, nuScenes, Argoverse V2, and ONCE dataset. Last but not least, our method supports kinds of advanced linear RNN operators (e.g., RetNet, RWKV, Mamba, xLSTM and TTT) on small but popular KITTI dataset for a quick experience with our linear RNN-based framework.
Paper Structure (21 sections, 4 equations, 8 figures, 13 tables)

This paper contains 21 sections, 4 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: (a) Comparison of different 3D backbones in terms of detection performance on Waymo sun2020scalability, nuScenes caesar2020nuscenes, Argoverse V2 wilson2023argoverse and ONCE mao2021one datasets. Here, we adopt Mamba gu2023mamba as the default operator of our LION. Besides, we present the simplified schematic of DSVT (b) wang2023dsvt and our LION (c) for implementing feature interaction in 3D backbones.
  • Figure 2: The illustration of LION, which mainly consists of several LION blocks, each paired with a voxel generation for feature enhancement and a voxel merging for down-sampling features along the height dimension. $(H, W, D)$ indicates the shape of the 3D feature map, where $H$, $W$, and $D$ are the length, width, and height of the 3D feature map along the X-axis, Y-axis, and Z-axis. $N$ is the number of LION blocks. In LION, we first convert point clouds to voxels and partition these voxels into a series of equal-size groups. Then, we feed these grouped features into LION 3D backbone to enhance their feature representation. Finally, these enhanced features are fed into a BEV backbone and a detection head for final 3D detection.
  • Figure 3: (a) shows the structure of LION block, which involves four LION layers, two voxel merging operations, two voxel expanding operations, and two 3D spatial feature descriptors. Here, $1\times$, $\frac{1}{2}\times$, and $\frac{1}{4}\times$ indicate the resolution of 3D feature map as $(H,W,D)$, $(H/2,W/2,D/2)$ and $(H/4,W/4,D/4)$, respectively. (b) is the process of voxel merging for voxel down-sampling and voxel expanding for voxel up-sampling. (c) presents the structure of LION layer. (d) shows the details of the 3D spatial feature descriptor.
  • Figure 4: The illustration of spatial information loss when flattening into 1D sequences. For example, there are two adjacent voxels in spatial position (indexed as 01 and 34) but are far in the 1D sequences along the X order.
  • Figure 5: The details of voxel generation. For input voxels, we first select the foreground voxels and diffuse them along different directions. Then, we initialize the corresponding features of the diffused voxels as zeros and utilize the auto-regressive ability of the following LION block to generate diffused features. Note that we do not present the voxel merging here for simplicity.
  • ...and 3 more figures