Sparse4D v3: Advancing End-to-End 3D Detection and Tracking
Xuewu Lin, Zixiang Pei, Tianwei Lin, Lichao Huang, Zhizhong Su
TL;DR
Sparse4D-v3 advances end-to-end 3D perception by introducing Temporal Instance Denoising, Quality Estimation, and Decoupled Attention to stabilize decoder training and improve detection quality. It also presents a simple yet effective extension to end-to-end 3D tracking by treating temporal instances as trajectories with assigned IDs, without altering training. Experiments on nuScenes show consistent gains in mAP, NDS, and AMOTA, with state-of-the-art tracking metrics and robust performance across backbones and image resolutions. Cloud-based boosts using multi-frame feature fusion and a stronger backbone further boost detection and tracking performance, underscoring Sparse4D-v3’s practical potential for autonomous driving systems.
Abstract
In autonomous driving perception systems, 3D detection and tracking are the two fundamental tasks. This paper delves deeper into this field, building upon the Sparse4D framework. We introduce two auxiliary training tasks (Temporal Instance Denoising and Quality Estimation) and propose decoupled attention to make structural improvements, leading to significant enhancements in detection performance. Additionally, we extend the detector into a tracker using a straightforward approach that assigns instance ID during inference, further highlighting the advantages of query-based algorithms. Extensive experiments conducted on the nuScenes benchmark validate the effectiveness of the proposed improvements. With ResNet50 as the backbone, we witnessed enhancements of 3.0\%, 2.2\%, and 7.6\% in mAP, NDS, and AMOTA, achieving 46.9\%, 56.1\%, and 49.0\%, respectively. Our best model achieved 71.9\% NDS and 67.7\% AMOTA on the nuScenes test set. Code will be released at \url{https://github.com/linxuewu/Sparse4D}.
