Table of Contents
Fetching ...

Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences

Rui Yu, Runkai Zhao, Cong Nie, Heng Wang, HuaiCheng Yan, Meng Wang

TL;DR

LiDAR-based 3D object detection in autonomous driving suffers from data sparsity, occlusion, and limited information transfer across frames. The authors propose LiSTM, a motion-aware temporal fusion framework that uses a Kalman-filter-based motion prior to generate a motion heatmap and guides feature aggregation with MGFA, DCWM, and a Motion Transformer, operating on BEV representations. The method models a $10$-dimensional state $(x,y,z,\theta,l,w,h,\dot{x},\dot{y},\dot{z})$ and Gaussian heatmaps $N_{t-1}^{t}(\mu_k,\sigma_k^{2})$, $N_{t+1}^{t}(\mu_k,\sigma_k^{2})$ to encode forward/backward motion information for cross-frame fusion. Experiments on Waymo and nuScenes show LiSTM outperforms CenterPoint and other baselines in vehicle, pedestrian, and cyclist detection, with notably improved long-distance perception while maintaining computational efficiency.

Abstract

Accurate and robust LiDAR 3D object detection is essential for comprehensive scene understanding in autonomous driving. Despite its importance, LiDAR detection performance is limited by inherent constraints of point cloud data, particularly under conditions of extended distances and occlusions. Recently, temporal aggregation has been proven to significantly enhance detection accuracy by fusing multi-frame viewpoint information and enriching the spatial representation of objects. In this work, we introduce a novel LiDAR 3D object detection framework, namely LiSTM, to facilitate spatial-temporal feature learning with cross-frame motion forecasting information. We aim to improve the spatial-temporal interpretation capabilities of the LiDAR detector by incorporating a dynamic prior, generated from a non-learnable motion estimation model. Specifically, Motion-Guided Feature Aggregation (MGFA) is proposed to utilize the object trajectory from previous and future motion states to model spatial-temporal correlations into gaussian heatmap over a driving sequence. This motion-based heatmap then guides the temporal feature fusion, enriching the proposed object features. Moreover, we design a Dual Correlation Weighting Module (DCWM) that effectively facilitates the interaction between past and prospective frames through scene- and channel-wise feature abstraction. In the end, a cascade cross-attention-based decoder is employed to refine the 3D prediction. We have conducted experiments on the Waymo and nuScenes datasets to demonstrate that the proposed framework achieves superior 3D detection performance with effective spatial-temporal feature learning.

Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences

TL;DR

LiDAR-based 3D object detection in autonomous driving suffers from data sparsity, occlusion, and limited information transfer across frames. The authors propose LiSTM, a motion-aware temporal fusion framework that uses a Kalman-filter-based motion prior to generate a motion heatmap and guides feature aggregation with MGFA, DCWM, and a Motion Transformer, operating on BEV representations. The method models a -dimensional state and Gaussian heatmaps , to encode forward/backward motion information for cross-frame fusion. Experiments on Waymo and nuScenes show LiSTM outperforms CenterPoint and other baselines in vehicle, pedestrian, and cyclist detection, with notably improved long-distance perception while maintaining computational efficiency.

Abstract

Accurate and robust LiDAR 3D object detection is essential for comprehensive scene understanding in autonomous driving. Despite its importance, LiDAR detection performance is limited by inherent constraints of point cloud data, particularly under conditions of extended distances and occlusions. Recently, temporal aggregation has been proven to significantly enhance detection accuracy by fusing multi-frame viewpoint information and enriching the spatial representation of objects. In this work, we introduce a novel LiDAR 3D object detection framework, namely LiSTM, to facilitate spatial-temporal feature learning with cross-frame motion forecasting information. We aim to improve the spatial-temporal interpretation capabilities of the LiDAR detector by incorporating a dynamic prior, generated from a non-learnable motion estimation model. Specifically, Motion-Guided Feature Aggregation (MGFA) is proposed to utilize the object trajectory from previous and future motion states to model spatial-temporal correlations into gaussian heatmap over a driving sequence. This motion-based heatmap then guides the temporal feature fusion, enriching the proposed object features. Moreover, we design a Dual Correlation Weighting Module (DCWM) that effectively facilitates the interaction between past and prospective frames through scene- and channel-wise feature abstraction. In the end, a cascade cross-attention-based decoder is employed to refine the 3D prediction. We have conducted experiments on the Waymo and nuScenes datasets to demonstrate that the proposed framework achieves superior 3D detection performance with effective spatial-temporal feature learning.
Paper Structure (8 sections, 12 equations, 5 figures, 9 tables)

This paper contains 8 sections, 12 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Different from the global bird's eye view (BEV) Neighbor Feature Fusion Method (a) and Trajectory-based Method (b) which do not count for the role of the future states, we propose a novel LiDAR 3D object detection framework that utilizes motion forecasting to guide the temporal fusion learning across past and future frames as shown in (c).
  • Figure 2: Overview of our proposed framework LiSTM. It processes multi-frame point clouds by performing voxelization before feeding them into the LiDAR BEV encoder. The first module employs a single-stage detector combined with tracking prediction to produce trajectories and then enhances the spatial representation with a Motion-Guided Feature Aggregation Module. The second module is used for cross-frame feature extraction by the proposed Dual Correlation Weighting Module and Motion Transformer.
  • Figure 3: Motion Guided Feature Aggregation.
  • Figure 4: Dual Correlation Weighting Module.
  • Figure 5: Qualitative visualization of our LiSTM on Waymo validation set. We show the 3D boxes predictions in the LiDAR bird's-eye-view