Table of Contents
Fetching ...

Unleashing the Potential of Mamba: Boosting a LiDAR 3D Sparse Detector by Using Cross-Model Knowledge Distillation

Rui Yu, Runkai Zhao, Jiagen Li, Qingsong Zhao, Songhao Zhu, HuaiCheng Yan, Meng Wang

TL;DR

This work tackles the challenge of achieving real-time, accurate LiDAR-based 3D detection by addressing the heavy computational burden of Transformer-based global context modeling. It introduces FASD, a cross-model knowledge-distillation framework that transfers Transformer capabilities into a lightweight Mamba-based student using Dynamic Voxel Group, Adaptive Attention, Voxel Diffusion, and an Adapter, with latent-space and span-head distillation guiding learning. The approach yields state-of-the-art results on Waymo and nuScenes while delivering a roughly 4x reduction in resource consumption and up to 1–2% accuracy gains, illustrating the practicality of efficient sequence modeling for autonomous perception. The main contribution is a coherent pipeline that preserves global context and improves spatial understanding in a resource-constrained model, supporting real-time LiDAR perception for autonomous driving and robotics.

Abstract

The LiDAR-based 3D object detector that strikes a balance between accuracy and speed is crucial for achieving real-time perception in autonomous driving and robotic navigation systems. To enhance the accuracy of point cloud detection, integrating global context for visual understanding improves the point clouds ability to grasp overall spatial information. However, many existing LiDAR detection models depend on intricate feature transformation and extraction processes, leading to poor real-time performance and high resource consumption, which limits their practical effectiveness. In this work, we propose a Faster LiDAR 3D object detection framework, called FASD, which implements heterogeneous model distillation by adaptively uniform cross-model voxel features. We aim to distill the transformer's capacity for high-performance sequence modeling into Mamba models with low FLOPs, achieving a significant improvement in accuracy through knowledge transfer. Specifically, Dynamic Voxel Group and Adaptive Attention strategies are integrated into the sparse backbone, creating a robust teacher model with scale-adaptive attention for effective global visual context modeling. Following feature alignment with the Adapter, we transfer knowledge from the Transformer to the Mamba through latent space feature supervision and span-head distillation, resulting in improved performance and an efficient student model. We evaluated the framework on the Waymo and nuScenes datasets, achieving a 4x reduction in resource consumption and a 1-2\% performance improvement over the current SoTA methods.

Unleashing the Potential of Mamba: Boosting a LiDAR 3D Sparse Detector by Using Cross-Model Knowledge Distillation

TL;DR

This work tackles the challenge of achieving real-time, accurate LiDAR-based 3D detection by addressing the heavy computational burden of Transformer-based global context modeling. It introduces FASD, a cross-model knowledge-distillation framework that transfers Transformer capabilities into a lightweight Mamba-based student using Dynamic Voxel Group, Adaptive Attention, Voxel Diffusion, and an Adapter, with latent-space and span-head distillation guiding learning. The approach yields state-of-the-art results on Waymo and nuScenes while delivering a roughly 4x reduction in resource consumption and up to 1–2% accuracy gains, illustrating the practicality of efficient sequence modeling for autonomous perception. The main contribution is a coherent pipeline that preserves global context and improves spatial understanding in a resource-constrained model, supporting real-time LiDAR perception for autonomous driving and robotics.

Abstract

The LiDAR-based 3D object detector that strikes a balance between accuracy and speed is crucial for achieving real-time perception in autonomous driving and robotic navigation systems. To enhance the accuracy of point cloud detection, integrating global context for visual understanding improves the point clouds ability to grasp overall spatial information. However, many existing LiDAR detection models depend on intricate feature transformation and extraction processes, leading to poor real-time performance and high resource consumption, which limits their practical effectiveness. In this work, we propose a Faster LiDAR 3D object detection framework, called FASD, which implements heterogeneous model distillation by adaptively uniform cross-model voxel features. We aim to distill the transformer's capacity for high-performance sequence modeling into Mamba models with low FLOPs, achieving a significant improvement in accuracy through knowledge transfer. Specifically, Dynamic Voxel Group and Adaptive Attention strategies are integrated into the sparse backbone, creating a robust teacher model with scale-adaptive attention for effective global visual context modeling. Following feature alignment with the Adapter, we transfer knowledge from the Transformer to the Mamba through latent space feature supervision and span-head distillation, resulting in improved performance and an efficient student model. We evaluated the framework on the Waymo and nuScenes datasets, achieving a 4x reduction in resource consumption and a 1-2\% performance improvement over the current SoTA methods.
Paper Structure (17 sections, 14 equations, 7 figures, 8 tables)

This paper contains 17 sections, 14 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Performance comparisons of various existing LiDAR 3D detection models. Through our proposed FASD framework, the Mamba-based student model acheives the SoTA performances on all metrics of Waymo and nuScenes validation datasets. For simplicity, we omit the performance of the Transformer-based studentmodel here.
  • Figure 2: Illustrates of Transformer, Linformer and Mamba in terms of FLOPs as change with respect to Batch Size, Sequence Length, and Parameters.
  • Figure 3: The Overview of our proposed FASD pipeline. FASD can be divided into the Transformer-based Teacher Model, the Mamba-based Student model, and the Knowledge Distillation. The frozen teacher model is dedicated to mentor the student model by providing a comprehensive guide for learning both global visual context and detailed local spatial features.
  • Figure 4: The basic steps of the overall FASD process begin by dividing the 3D voxel space into $N$ groups. These groups are then sequentially expanded into a long sequence and passed to both the teacher and student models. Knowledge transfer between the models is achieved through an adapter.
  • Figure 5: Visualization of heterogeneous model features shows that Transformers capture more pronounced global geometry. Therefore, distillation is required to address the issues in Mamba.
  • ...and 2 more figures