Table of Contents
Fetching ...

Towards Streaming LiDAR Object Detection with Point Clouds as Egocentric Sequences

Mellon M. Zhang, Glen Chou, Saibal Mukhopadhyay

TL;DR

This work addresses the latency-accuracy trade-off in LiDAR-based 3D detection by introducing Polar-Fast-Cartesian-Full (PFCF), a hybrid detector that combines a fast polar streaming backbone with a lightweight Cartesian full-scan backbone through a Sector Feature Buffer. Central to PFCF is Polar Hierarchical Mamba (PHiM), a polar-native state-space backbone that uses dimensionally-decomposed convolutions to mitigate polar distortion while preserving streaming efficiency. The approach achieves a new Pareto frontier on the Waymo Open dataset, surpassing prior streaming methods by about 10% mAP and matching full-scan accuracy at roughly twice the update rate, with strong generalization to nuScenes. The combination of SFB-based cross-sector fusion, PHiM's temporal-spatial modeling, and distortion-aware feature learning enables full-scene predictions on streaming inputs, offering practical benefits for real-time autonomous driving perception.

Abstract

Accurate and low-latency 3D object detection is essential for autonomous driving, where safety hinges on both rapid response and reliable perception. While rotating LiDAR sensors are widely adopted for their robustness and fidelity, current detectors face a trade-off: streaming methods process partial polar sectors on the fly for fast updates but suffer from limited visibility, cross-sector dependencies, and distortions from retrofitted Cartesian designs, whereas full-scan methods achieve higher accuracy but are bottlenecked by the inherent latency of a LiDAR revolution. We propose Polar-Fast-Cartesian-Full (PFCF), a hybrid detector that combines fast polar processing for intra-sector feature extraction with accurate Cartesian reasoning for full-scene understanding. Central to PFCF is a custom Mamba SSM-based streaming backbone with dimensionally-decomposed convolutions that avoids distortion-heavy planes, enabling parameter-efficient, translation-invariant, and distortion-robust polar representation learning. Local sector features are extracted via this backbone, then accumulated into a sector feature buffer to enable efficient inter-sector communication through a full-scan backbone. PFCF establishes a new Pareto frontier on the Waymo Open dataset, surpassing prior streaming baselines by 10% mAP and matching full-scan accuracy at twice the update rate. Code is available at \href{https://github.com/meilongzhang/Polar-Hierarchical-Mamba}{https://github.com/meilongzhang/Polar-Hierarchical-Mamba}.

Towards Streaming LiDAR Object Detection with Point Clouds as Egocentric Sequences

TL;DR

This work addresses the latency-accuracy trade-off in LiDAR-based 3D detection by introducing Polar-Fast-Cartesian-Full (PFCF), a hybrid detector that combines a fast polar streaming backbone with a lightweight Cartesian full-scan backbone through a Sector Feature Buffer. Central to PFCF is Polar Hierarchical Mamba (PHiM), a polar-native state-space backbone that uses dimensionally-decomposed convolutions to mitigate polar distortion while preserving streaming efficiency. The approach achieves a new Pareto frontier on the Waymo Open dataset, surpassing prior streaming methods by about 10% mAP and matching full-scan accuracy at roughly twice the update rate, with strong generalization to nuScenes. The combination of SFB-based cross-sector fusion, PHiM's temporal-spatial modeling, and distortion-aware feature learning enables full-scene predictions on streaming inputs, offering practical benefits for real-time autonomous driving perception.

Abstract

Accurate and low-latency 3D object detection is essential for autonomous driving, where safety hinges on both rapid response and reliable perception. While rotating LiDAR sensors are widely adopted for their robustness and fidelity, current detectors face a trade-off: streaming methods process partial polar sectors on the fly for fast updates but suffer from limited visibility, cross-sector dependencies, and distortions from retrofitted Cartesian designs, whereas full-scan methods achieve higher accuracy but are bottlenecked by the inherent latency of a LiDAR revolution. We propose Polar-Fast-Cartesian-Full (PFCF), a hybrid detector that combines fast polar processing for intra-sector feature extraction with accurate Cartesian reasoning for full-scene understanding. Central to PFCF is a custom Mamba SSM-based streaming backbone with dimensionally-decomposed convolutions that avoids distortion-heavy planes, enabling parameter-efficient, translation-invariant, and distortion-robust polar representation learning. Local sector features are extracted via this backbone, then accumulated into a sector feature buffer to enable efficient inter-sector communication through a full-scan backbone. PFCF establishes a new Pareto frontier on the Waymo Open dataset, surpassing prior streaming baselines by 10% mAP and matching full-scan accuracy at twice the update rate. Code is available at \href{https://github.com/meilongzhang/Polar-Hierarchical-Mamba}{https://github.com/meilongzhang/Polar-Hierarchical-Mamba}.

Paper Structure

This paper contains 39 sections, 12 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparison of Perception Paradigms. With the introduction of a sector feature buffer, PFCF combines the rapid updates and low peak memory utilization of streaming methods with the accuracy of full-scan methods. The solid blue lines correspond to accuracy, the dashed red lines correspond to memory usage. The $F$ values are inputs, the $P$ values are predictions, $\delta s$ is the sensor latency and $\delta m$ is the model latency.
  • Figure 2: Comparison with existing streaming works. (a) As opposed to full scan methods that aggregate points into complete point clouds from full rotations of the LiDAR sensor, streaming methods operate on partial point cloud sectors as they are collected. Most methods make partial predictions based on these partial input sectors. (b) Prior streaming approaches include models operating on Cartesian coordinates using RNNs or convolutions (top), and those using polar coordinates with convolutions or attention mechanisms (bottom). These methods rely on combinations of stateful memory, post-processing, or auxiliary modalities to model spatiotemporal correlations between new and past sectors. (c) PHiM encodes spatiotemporal interactions directly into the hidden state of a state-space model (SSM), while storing sector-level features. This enables full-scene predictions from individual sectors and requires no additional modalities, a priori context padding, or post-processing.
  • Figure 3: Polar-Fast-Cartesian-Full (PFCF) Pipeline. Polar-Fast-Cartesian-Full processes partial input sectors independently through a polar streaming backbone consisting of stacked Polar Hierarchical Mamba (PHiM) blocks (Fig. \ref{['fig:block_diagram']}). These PHiM block saggregate local features, encode sector-level context and propagate information forward in time. Features are then stitched together with the buffered features from previous sectors, projected into the Cartesian 2D BEV space, and refined by a lightweight BEV backbone before detection via a CenterPoint head. In this way, PFCF can ingest partial sectors yet output updated predictions over the entire scene at every timestep.
  • Figure 4: PHiM Block. Serialization is according to the azimuth angle, and the bidirectional local SSM is aggregated with an elementwise addition.
  • Figure 5: Performance and speed comparison on Waymo Open. PFCF offers the speed and throughput benefits of streaming methods with the competitive performance of full-scan methods. For streaming methods, $\frac{1}{N}$ denotes the size of each partial sector -- for example, $\frac{1}{4}$ means each sector is one quarter of a full point cloud. (Left) Throughput is measured end-to-end including sensor latency assuming simultaneous sensing and perception. (Right) Inference speed is measured end-to-end without sensor latency on batch size 1.
  • ...and 5 more figures