Table of Contents
Fetching ...

Sparse4D v3: Advancing End-to-End 3D Detection and Tracking

Xuewu Lin, Zixiang Pei, Tianwei Lin, Lichao Huang, Zhizhong Su

TL;DR

Sparse4D-v3 advances end-to-end 3D perception by introducing Temporal Instance Denoising, Quality Estimation, and Decoupled Attention to stabilize decoder training and improve detection quality. It also presents a simple yet effective extension to end-to-end 3D tracking by treating temporal instances as trajectories with assigned IDs, without altering training. Experiments on nuScenes show consistent gains in mAP, NDS, and AMOTA, with state-of-the-art tracking metrics and robust performance across backbones and image resolutions. Cloud-based boosts using multi-frame feature fusion and a stronger backbone further boost detection and tracking performance, underscoring Sparse4D-v3’s practical potential for autonomous driving systems.

Abstract

In autonomous driving perception systems, 3D detection and tracking are the two fundamental tasks. This paper delves deeper into this field, building upon the Sparse4D framework. We introduce two auxiliary training tasks (Temporal Instance Denoising and Quality Estimation) and propose decoupled attention to make structural improvements, leading to significant enhancements in detection performance. Additionally, we extend the detector into a tracker using a straightforward approach that assigns instance ID during inference, further highlighting the advantages of query-based algorithms. Extensive experiments conducted on the nuScenes benchmark validate the effectiveness of the proposed improvements. With ResNet50 as the backbone, we witnessed enhancements of 3.0\%, 2.2\%, and 7.6\% in mAP, NDS, and AMOTA, achieving 46.9\%, 56.1\%, and 49.0\%, respectively. Our best model achieved 71.9\% NDS and 67.7\% AMOTA on the nuScenes test set. Code will be released at \url{https://github.com/linxuewu/Sparse4D}.

Sparse4D v3: Advancing End-to-End 3D Detection and Tracking

TL;DR

Sparse4D-v3 advances end-to-end 3D perception by introducing Temporal Instance Denoising, Quality Estimation, and Decoupled Attention to stabilize decoder training and improve detection quality. It also presents a simple yet effective extension to end-to-end 3D tracking by treating temporal instances as trajectories with assigned IDs, without altering training. Experiments on nuScenes show consistent gains in mAP, NDS, and AMOTA, with state-of-the-art tracking metrics and robust performance across backbones and image resolutions. Cloud-based boosts using multi-frame feature fusion and a stronger backbone further boost detection and tracking performance, underscoring Sparse4D-v3’s practical potential for autonomous driving systems.

Abstract

In autonomous driving perception systems, 3D detection and tracking are the two fundamental tasks. This paper delves deeper into this field, building upon the Sparse4D framework. We introduce two auxiliary training tasks (Temporal Instance Denoising and Quality Estimation) and propose decoupled attention to make structural improvements, leading to significant enhancements in detection performance. Additionally, we extend the detector into a tracker using a straightforward approach that assigns instance ID during inference, further highlighting the advantages of query-based algorithms. Extensive experiments conducted on the nuScenes benchmark validate the effectiveness of the proposed improvements. With ResNet50 as the backbone, we witnessed enhancements of 3.0\%, 2.2\%, and 7.6\% in mAP, NDS, and AMOTA, achieving 46.9\%, 56.1\%, and 49.0\%, respectively. Our best model achieved 71.9\% NDS and 67.7\% AMOTA on the nuScenes test set. Code will be released at \url{https://github.com/linxuewu/Sparse4D}.
Paper Structure (16 sections, 3 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 3 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of Sparse4D framework, which input mutli-view video and output the perception results of all frames.
  • Figure 2: Inference efficiency (FPS) - perception performance (mAP) on nuScenes validation dataset of different algorithms.
  • Figure 3: Visualizing Attention Weights in Instance Self-Attention: 1) The first row reveals attention weights in vanilla self-attention, where pedestrians in red circles show unintended correlations with the target vehicle (green box). 2) The second row displays attention weights in decoupled attention, effectively addressing the issue.
  • Figure 4: Illustration of Temporal Instance Denoising. (a) During the training phase, instances comprise two components: learnable and noisy. Noisy instances consist of both temporal and non-temporal elements. For noisy instances, we employ a pre-matching approach to allocate positive and negative samples—matching anchors with ground truth, while learnable instances are matched with predictions and ground truth . During the testing phase, only the green blocks in the diagram are retained. (b) Attention mask is employed to prevent feature propagation between groups, where gray indicates no attention between queries and keys, and green denotes the opposite.
  • Figure 5: Architecture of the anchor encoder and attention. We independently conduct high-dimensional feature encoding on multiple components of the anchor and subsequently concatenate them. This approach leads to lower computational and parameter overhead compared to the original Sparse4D. $E$ and $F$ represent anchor embedding and instance feature, respectively.
  • ...and 1 more figures