Table of Contents
Fetching ...

Learning Temporal Cues by Predicting Objects Move for Multi-camera 3D Object Detection

Seokha Moon, Hongbeen Park, Jungphil Kwon, Jaekoo Lee, Jinkyu Kim

TL;DR

This work tackles multi-camera 3D object detection by leveraging temporal context through a predictive learning paradigm. It introduces DAP, a two-branch architecture with a Prediction Encoder that forecasts current object poses from past BEV features and a Context Fused Detection module that integrates this predictive information into detection, using a Fusion DeFormable Attention mechanism. On nuScenes, integrating predictive cues improves BEVDet4D and BEVDepth baselines in NDS and mAP, demonstrating enhanced temporal cue utilization, especially for occluded or moving objects. The approach is plug-and-play with existing BEV-based detectors and shows practical potential for robust, motion-aware 3D detection in autonomous driving and robotics.

Abstract

In autonomous driving and robotics, there is a growing interest in utilizing short-term historical data to enhance multi-camera 3D object detection, leveraging the continuous and correlated nature of input video streams. Recent work has focused on spatially aligning BEV-based features over timesteps. However, this is often limited as its gain does not scale well with long-term past observations. To address this, we advocate for supervising a model to predict objects' poses given past observations, thus explicitly guiding to learn objects' temporal cues. To this end, we propose a model called DAP (Detection After Prediction), consisting of a two-branch network: (i) a branch responsible for forecasting the current objects' poses given past observations and (ii) another branch that detects objects based on the current and past observations. The features predicting the current objects from branch (i) is fused into branch (ii) to transfer predictive knowledge. We conduct extensive experiments with the large-scale nuScenes datasets, and we observe that utilizing such predictive information significantly improves the overall detection performance. Our model can be used plug-and-play, showing consistent performance gain.

Learning Temporal Cues by Predicting Objects Move for Multi-camera 3D Object Detection

TL;DR

This work tackles multi-camera 3D object detection by leveraging temporal context through a predictive learning paradigm. It introduces DAP, a two-branch architecture with a Prediction Encoder that forecasts current object poses from past BEV features and a Context Fused Detection module that integrates this predictive information into detection, using a Fusion DeFormable Attention mechanism. On nuScenes, integrating predictive cues improves BEVDet4D and BEVDepth baselines in NDS and mAP, demonstrating enhanced temporal cue utilization, especially for occluded or moving objects. The approach is plug-and-play with existing BEV-based detectors and shows practical potential for robust, motion-aware 3D detection in autonomous driving and robotics.

Abstract

In autonomous driving and robotics, there is a growing interest in utilizing short-term historical data to enhance multi-camera 3D object detection, leveraging the continuous and correlated nature of input video streams. Recent work has focused on spatially aligning BEV-based features over timesteps. However, this is often limited as its gain does not scale well with long-term past observations. To address this, we advocate for supervising a model to predict objects' poses given past observations, thus explicitly guiding to learn objects' temporal cues. To this end, we propose a model called DAP (Detection After Prediction), consisting of a two-branch network: (i) a branch responsible for forecasting the current objects' poses given past observations and (ii) another branch that detects objects based on the current and past observations. The features predicting the current objects from branch (i) is fused into branch (ii) to transfer predictive knowledge. We conduct extensive experiments with the large-scale nuScenes datasets, and we observe that utilizing such predictive information significantly improves the overall detection performance. Our model can be used plug-and-play, showing consistent performance gain.
Paper Structure (12 sections, 5 equations, 4 figures, 5 tables)

This paper contains 12 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Unlike prior multi-camera 3D object detection approaches, which utilize the current and past observations to detect objects, our proposed method regularizes the model by predicting objects' current poses from past observations, and such predictive knowledge is then augmented into object detector, enhancing overall object detection performance.
  • Figure 2: Our proposed multi-view 3D object detection architecture. Built upon a conventional BEV(Bird's Eye View)-based multi-view object detection model, our model consists of two main modules: (i) Temporal Context Extraction Module, which predicts objects' current poses conditioned on past BEV-based observations. (ii) Context Fused Detection Module, which detects 3D objects in the scene based on the current and past BEV-based observations. Intermediate BEV feature from the Temporal Context Extraction Module is fused into the Context Fused Detection Module through the Fusion Encoder for the final verdict.
  • Figure 3: Examples of detected objects (visualization on LiDAR and LiDAR-Image). (A) shows results from the base model, ours, and prediction (i.e., poses of predicted objects given only past observations). (B) provides a clearer view of the results from the base model and ours through visualization on LiDAR and image. The detected boxes in the Base model and Ours model are marked in green and purple, respectively. Ground truth boxes are marked in blue.
  • Figure 4: Visualization of two scenarios: one with occlusion occurring over time (A) and the other with an object making a left turn (B). The results from both scenarios demonstrate the contribution of predictions to object detection.