Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences
Rui Yu, Runkai Zhao, Cong Nie, Heng Wang, HuaiCheng Yan, Meng Wang
TL;DR
LiDAR-based 3D object detection in autonomous driving suffers from data sparsity, occlusion, and limited information transfer across frames. The authors propose LiSTM, a motion-aware temporal fusion framework that uses a Kalman-filter-based motion prior to generate a motion heatmap and guides feature aggregation with MGFA, DCWM, and a Motion Transformer, operating on BEV representations. The method models a $10$-dimensional state $(x,y,z,\theta,l,w,h,\dot{x},\dot{y},\dot{z})$ and Gaussian heatmaps $N_{t-1}^{t}(\mu_k,\sigma_k^{2})$, $N_{t+1}^{t}(\mu_k,\sigma_k^{2})$ to encode forward/backward motion information for cross-frame fusion. Experiments on Waymo and nuScenes show LiSTM outperforms CenterPoint and other baselines in vehicle, pedestrian, and cyclist detection, with notably improved long-distance perception while maintaining computational efficiency.
Abstract
Accurate and robust LiDAR 3D object detection is essential for comprehensive scene understanding in autonomous driving. Despite its importance, LiDAR detection performance is limited by inherent constraints of point cloud data, particularly under conditions of extended distances and occlusions. Recently, temporal aggregation has been proven to significantly enhance detection accuracy by fusing multi-frame viewpoint information and enriching the spatial representation of objects. In this work, we introduce a novel LiDAR 3D object detection framework, namely LiSTM, to facilitate spatial-temporal feature learning with cross-frame motion forecasting information. We aim to improve the spatial-temporal interpretation capabilities of the LiDAR detector by incorporating a dynamic prior, generated from a non-learnable motion estimation model. Specifically, Motion-Guided Feature Aggregation (MGFA) is proposed to utilize the object trajectory from previous and future motion states to model spatial-temporal correlations into gaussian heatmap over a driving sequence. This motion-based heatmap then guides the temporal feature fusion, enriching the proposed object features. Moreover, we design a Dual Correlation Weighting Module (DCWM) that effectively facilitates the interaction between past and prospective frames through scene- and channel-wise feature abstraction. In the end, a cascade cross-attention-based decoder is employed to refine the 3D prediction. We have conducted experiments on the Waymo and nuScenes datasets to demonstrate that the proposed framework achieves superior 3D detection performance with effective spatial-temporal feature learning.
