Table of Contents
Fetching ...

Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, Zhizhong Su

TL;DR

Sparse4Dv2 tackles the inefficiency of temporal fusion in sparse-based 3D perception by introducing a recurrent instance-propagation mechanism that decouples image features from anchor states. The approach leverages Efficient Deformable Aggregation, camera-parameter encoding, and dense-depth supervision within an encoder–decoder framework to enable frame-by-frame sparse feature fusion with constant-time temporal complexity. Empirical results on nuScenes show state-of-the-art NDS and mAP among sparse methods and competitive speed, outperforming several BEV- and sparse-query baselines. This work offers a practical, scalable pathway for long-term temporal perception in autonomous systems, with potential extensions to HD maps, trajectory prediction, and end-to-end planning.

Abstract

Sparse algorithms offer great flexibility for multi-view temporal perception tasks. In this paper, we present an enhanced version of Sparse4D, in which we improve the temporal fusion module by implementing a recursive form of multi-frame feature sampling. By effectively decoupling image features and structured anchor features, Sparse4D enables a highly efficient transformation of temporal features, thereby facilitating temporal fusion solely through the frame-by-frame transmission of sparse features. The recurrent temporal fusion approach provides two main benefits. Firstly, it reduces the computational complexity of temporal fusion from $O(T)$ to $O(1)$, resulting in significant improvements in inference speed and memory usage. Secondly, it enables the fusion of long-term information, leading to more pronounced performance improvements due to temporal fusion. Our proposed approach, Sparse4Dv2, further enhances the performance of the sparse perception algorithm and achieves state-of-the-art results on the nuScenes 3D detection benchmark. Code will be available at \url{https://github.com/linxuewu/Sparse4D}.

Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

TL;DR

Sparse4Dv2 tackles the inefficiency of temporal fusion in sparse-based 3D perception by introducing a recurrent instance-propagation mechanism that decouples image features from anchor states. The approach leverages Efficient Deformable Aggregation, camera-parameter encoding, and dense-depth supervision within an encoder–decoder framework to enable frame-by-frame sparse feature fusion with constant-time temporal complexity. Empirical results on nuScenes show state-of-the-art NDS and mAP among sparse methods and competitive speed, outperforming several BEV- and sparse-query baselines. This work offers a practical, scalable pathway for long-term temporal perception in autonomous systems, with potential extensions to HD maps, trajectory prediction, and end-to-end planning.

Abstract

Sparse algorithms offer great flexibility for multi-view temporal perception tasks. In this paper, we present an enhanced version of Sparse4D, in which we improve the temporal fusion module by implementing a recursive form of multi-frame feature sampling. By effectively decoupling image features and structured anchor features, Sparse4D enables a highly efficient transformation of temporal features, thereby facilitating temporal fusion solely through the frame-by-frame transmission of sparse features. The recurrent temporal fusion approach provides two main benefits. Firstly, it reduces the computational complexity of temporal fusion from to , resulting in significant improvements in inference speed and memory usage. Secondly, it enables the fusion of long-term information, leading to more pronounced performance improvements due to temporal fusion. Our proposed approach, Sparse4Dv2, further enhances the performance of the sparse perception algorithm and achieves state-of-the-art results on the nuScenes 3D detection benchmark. Code will be available at \url{https://github.com/linxuewu/Sparse4D}.
Paper Structure (17 sections, 2 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 17 sections, 2 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison of two different temporal fusion approaches. (a) Sparse4D requires projecting the anchors of the current frame onto each historical frame, followed by multi-frame feature sampling and fusion. (b) Sparse4Dv2 achieves fusion through the propagation of instance features.
  • Figure 2: Overall Framework of Sparse4Dv2, which conforms to an encoder-decoder structure. The inputs consists of three components: multi-view images, camera parameters, and instance information from previous frames. The output is the refined instances (anchors and corresponding features), serve as the perception results for the current frame. Additionally, a subset of these instances is selected and used as input for the next frame.
  • Figure 3: Efficient Deformable Aggregation.