Table of Contents
Fetching ...

Cyclic Refiner: Object-Aware Temporal Representation Learning for Multi-View 3D Detection and Tracking

Mingzhe Guo, Zhipeng Zhang, Liping Jing, Yuan He, Ke Wang, Heng Fan

TL;DR

This paper addresses the vulnerability of temporal fusion in multi-view 3D detection and tracking to distractors from historical frames. It introduces Cyclic Refiner, a backward refinement mechanism that uses posterior object predictions to refine image and BEV features prior to temporal fusion, forming a cycle with forward inference across times $t$ and $t+1$. It couples this with an Object-aware Association that performs Multi-clue Matching across image, BEV, and head embeddings and a Cascaded Scale-aware Matching to handle objects at different scales, delivering a unified framework (CycBEVFormer/CycSparseBEV/CycBEVDet4D) for detection and tracking on nuScenes with consistent gains. The results demonstrate reduced temporal error accumulation and improved robustness in both perception tasks, highlighting practical impact for camera-based autonomous driving systems.

Abstract

We propose a unified object-aware temporal learning framework for multi-view 3D detection and tracking tasks. Having observed that the efficacy of the temporal fusion strategy in recent multi-view perception methods may be weakened by distractors and background clutters in historical frames, we propose a cyclic learning mechanism to improve the robustness of multi-view representation learning. The essence is constructing a backward bridge to propagate information from model predictions (e.g., object locations and sizes) to image and BEV features, which forms a circle with regular inference. After backward refinement, the responses of target-irrelevant regions in historical frames would be suppressed, decreasing the risk of polluting future frames and improving the object awareness ability of temporal fusion. We further tailor an object-aware association strategy for tracking based on the cyclic learning model. The cyclic learning model not only provides refined features, but also delivers finer clues (e.g., scale level) for tracklet association. The proposed cycle learning method and association module together contribute a novel and unified multi-task framework. Experiments on nuScenes show that the proposed model achieves consistent performance gains over baselines of different designs (i.e., dense query-based BEVFormer, sparse query-based SparseBEV and LSS-based BEVDet4D) on both detection and tracking evaluation.

Cyclic Refiner: Object-Aware Temporal Representation Learning for Multi-View 3D Detection and Tracking

TL;DR

This paper addresses the vulnerability of temporal fusion in multi-view 3D detection and tracking to distractors from historical frames. It introduces Cyclic Refiner, a backward refinement mechanism that uses posterior object predictions to refine image and BEV features prior to temporal fusion, forming a cycle with forward inference across times and . It couples this with an Object-aware Association that performs Multi-clue Matching across image, BEV, and head embeddings and a Cascaded Scale-aware Matching to handle objects at different scales, delivering a unified framework (CycBEVFormer/CycSparseBEV/CycBEVDet4D) for detection and tracking on nuScenes with consistent gains. The results demonstrate reduced temporal error accumulation and improved robustness in both perception tasks, highlighting practical impact for camera-based autonomous driving systems.

Abstract

We propose a unified object-aware temporal learning framework for multi-view 3D detection and tracking tasks. Having observed that the efficacy of the temporal fusion strategy in recent multi-view perception methods may be weakened by distractors and background clutters in historical frames, we propose a cyclic learning mechanism to improve the robustness of multi-view representation learning. The essence is constructing a backward bridge to propagate information from model predictions (e.g., object locations and sizes) to image and BEV features, which forms a circle with regular inference. After backward refinement, the responses of target-irrelevant regions in historical frames would be suppressed, decreasing the risk of polluting future frames and improving the object awareness ability of temporal fusion. We further tailor an object-aware association strategy for tracking based on the cyclic learning model. The cyclic learning model not only provides refined features, but also delivers finer clues (e.g., scale level) for tracklet association. The proposed cycle learning method and association module together contribute a novel and unified multi-task framework. Experiments on nuScenes show that the proposed model achieves consistent performance gains over baselines of different designs (i.e., dense query-based BEVFormer, sparse query-based SparseBEV and LSS-based BEVDet4D) on both detection and tracking evaluation.
Paper Structure (15 sections, 3 equations, 15 figures, 13 tables, 1 algorithm)

This paper contains 15 sections, 3 equations, 15 figures, 13 tables, 1 algorithm.

Figures (15)

  • Figure 1: (a) The illustration of our cyclic pipeline. After the first forward inference at time $t$, instead of directly propagating the distractor-contained features to the next frame through temporal fusion (black arrows), we exploit the posterior predictions as object-aware prior information to refine the former learned image and BEV features, i.e., "Backward Refinement" (red arrows). Then the refined features at time $t$ are forwarded to the temporal fusion and second forward inference at time $t+1$, which formulates a cyclic route to perform object-aware representation learning. (b) Visualization of background clutter pollution in temporal fusion. At time $t$, no detections are predicted over the background clutter (cyan circles). Then, at time $t+1$, the temporal model (BEVFormer-Temporal) mistakenly produces false positives over the background clutter, yet the static model (BEVFormer-Static) surprisingly does not, illustrating that background with high semantics in previous frames may corrupt future features after temporal fusion.
  • Figure 2: Illustration of the "backward refinement" in the proposed cyclic refiner. Each predicted object determines its scale level (i.e., small, mid and large) and weight in the mask by linearly mapping the concatenated three feature embeddings. By applying the masks on the image/BEV features, it can filter target-irrelevant distractors and benefit representation learning.
  • Figure 3: Illustration of the temporal fusion with our object-aware representations. The refined image/BEV features at time $t$ will concatenate with the learned features at time $t+1$ to generate object-aware attention weights and sampling offsets, guiding feature sampling on target-relevant regions in the deformable attention deformdetr.
  • Figure 4: Architecture of the proposed Object-aware Association (OAA). The Multi-clue Matching first matches the new detections and existing tracklets with the weighted summation of three embedding similarity matrixes. Then the 3D boxes of unmatched detections and tracklets are buffered with the assigned scale level of cyclic refiner, which are fed into the Cascading Scale-aware Matching to perform hierarchical IoU matching from large-scale objects to small-scale ones.
  • Figure 5: Architecture of the proposed object-aware temporal learning framework for both 3D detection and tracking tasks. After the forward inference at time $t$ (black arrows), the "backward refinement" of cyclic refiner exploits information in predictions to refine image and BEV features (red arrows). The refined features are then used for temporal fusion at time $t+1$. The proposed object-aware association exploits the refined features and predicted scale levels from cyclic refiner to perform object tracking (purple arrows).
  • ...and 10 more figures