Table of Contents
Fetching ...

Exploring Recurrent Long-term Temporal Fusion for Multi-view 3D Perception

Chunrui Han, Jinrong Yang, Jianjian Sun, Zheng Ge, Runpei Dong, Hongyu Zhou, Weixin Mao, Yuang Peng, Xiangyu Zhang

TL;DR

This paper tackles the challenge of leveraging long-term temporal context in camera-based multi-view BEV 3D perception. It introduces VideoBEV, a simple yet effective recurrent fusion framework built on LSS-based detectors that maintains a single long-term BEV memory and updates it frame-by-frame, enabling long horizons without the burden of parallel fusion. A temporal embedding module stabilizes motion understanding under missed frames, enhancing velocity and motion prediction tasks. Across nuScenes, VideoBEV achieves strong results on 3D object detection, map segmentation, tracking, and motion prediction, demonstrating that long-term temporal information can be exploited efficiently with a recurrent design.

Abstract

Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this paper, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains strong performance on various camera-based 3D perception tasks, including object detection (55.4\% mAP and 62.9\% NDS), segmentation (48.6\% vehicle mIoU), tracking (54.8\% AMOTA), and motion prediction (0.80m minADE and 0.463 EPA).

Exploring Recurrent Long-term Temporal Fusion for Multi-view 3D Perception

TL;DR

This paper tackles the challenge of leveraging long-term temporal context in camera-based multi-view BEV 3D perception. It introduces VideoBEV, a simple yet effective recurrent fusion framework built on LSS-based detectors that maintains a single long-term BEV memory and updates it frame-by-frame, enabling long horizons without the burden of parallel fusion. A temporal embedding module stabilizes motion understanding under missed frames, enhancing velocity and motion prediction tasks. Across nuScenes, VideoBEV achieves strong results on 3D object detection, map segmentation, tracking, and motion prediction, demonstrating that long-term temporal information can be exploited efficiently with a recurrent design.

Abstract

Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this paper, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains strong performance on various camera-based 3D perception tasks, including object detection (55.4\% mAP and 62.9\% NDS), segmentation (48.6\% vehicle mIoU), tracking (54.8\% AMOTA), and motion prediction (0.80m minADE and 0.463 EPA).
Paper Structure (22 sections, 5 equations, 5 figures, 7 tables)

This paper contains 22 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Conceptual comparison of two mainstream temporal feature fusion mechanisms. (a) Parallel temporal propagation within fixed temporal segments of each time stamp VideoCNN14TwoStreamCNNVideo14FaF18BEVDet4D22SOLOFusion23FastBEV23BEVFormerV2; (b) Recurrent temporal fusion with an iteratively updated long-term memory within the video sequence of any length LSTM97NeuralMachineTranslation14Seq2SeqRNN14BEVFormer22. (c) Efficiency comparison between our recurrent style VideoBEV and parallel style SOLOFusion SOLOFusion23. (d) Comparison of benefits ($\Delta$mAP$\uparrow$ and $\Delta$NDS$\uparrow$) from long-term fusion between earlier recurrent style BEVFormer BEVFormer22 and our VideoBEV, the numbers of BEVFormer are taken from BEVFormer22.
  • Figure 2: Overview of VideoBEV. The backbone first extracts image features of different views of a frame, which are transformed to BEV from the image view to obtain the BEV feature. Then, the recurrent fusion module fuses the new BEV feature with the one of long-term memory, based on which the memory is updated and the 3D perception tasks are conducted.
  • Figure 3: Average velocity error (AVE$\downarrow$) versus frame missing rate (FMR). Without the proposed temporal embedding, the AVE is dramatically high when frames are missed, and this issue is substantially mitigated when using the proposed temporal embedding.
  • Figure 4: Visualization results of VideoBEV on the nuScenes val set. We show the predicted 3D box results of single frame baseline and VideoBEV with ResNet-50 backbone in multi-camera images and bird's-eye-view. The results of the baseline involving false negative, incorrect object orientation, and inaccurate occluded object identifications that are fixed by VideoBEV are highlighted with dashed circles in green, purple, and blue, respectively.
  • Figure 5: Efficiency comparison of two temporal feature fusion modules. By comparing with parallel style SOLOFusion SOLOFusion23: (a) Fusion module network parameter; (b) Memory cost of the fusion modules during inference.