Table of Contents
Fetching ...

Roadside Monocular 3D Detection Prompted by 2D Detection

Yechi Ma, Yanan Li, Wei Hua, Shu Kong

TL;DR

This work presents Pro3D, a Promptable 3D Detector for roadside monocular 3D detection that uses 2D detections as prompts to lift objects into 3D BEV. A prompt encoder plus an attention-based fusion module integrate 2D cues with 3D backbones, while a scene prior derived from a fixed roadside camera pose provides additional context. The method is detector-agnostic, shows substantial gains over BEVHeight/BEVSpread/BEVDepth on DAIR-V2X-I and Rope3D, and highlights the effectiveness of 2D box coordinates and labels as prompts, stage-wise training benefits, and scene priors. The approach achieves state-of-the-art performance, improves robustness to occlusions and lighting, and offers practical efficiency advantages suitable for real-world deployment.

Abstract

Roadside monocular 3D detection requires detecting objects of predefined classes in an RGB frame and predicting their 3D attributes, such as bird's-eye-view (BEV) locations. It has broad applications in traffic control, vehicle-vehicle communication, and vehicle-infrastructure cooperative perception. To address this task, we introduce Promptable 3D Detector (Pro3D), a novel detector design that leverages 2D detections as prompts. We build our Pro3D upon two key insights. First, compared to a typical 3D detector, a 2D detector is ``easier'' to train due to fewer loss terms and performs significantly better at localizing objects w.r.t 2D metrics. Second, once 2D detections precisely locate objects in the image, a 3D detector can focus on lifting these detections into 3D BEV, especially when fixed camera pose or scene geometry provide an informative prior. To encode and incorporate 2D detections, we explore three methods: (a) concatenating features from both 2D and 3D detectors, (b) attentively fusing 2D and 3D detector features, and (c) encoding properties of predicted 2D bounding boxes \{$x$, $y$, width, height, label\} and attentively fusing them with the 3D detector feature. Interestingly, the third method significantly outperforms the others, underscoring the effectiveness of 2D detections as prompts that offer precise object targets and allow the 3D detector to focus on lifting them into 3D. Pro3D is adaptable for use with a wide range of 2D and 3D detectors with minimal modifications. Comprehensive experiments demonstrate that our Pro3D significantly enhances existing methods, achieving state-of-the-art results on two contemporary benchmarks.

Roadside Monocular 3D Detection Prompted by 2D Detection

TL;DR

This work presents Pro3D, a Promptable 3D Detector for roadside monocular 3D detection that uses 2D detections as prompts to lift objects into 3D BEV. A prompt encoder plus an attention-based fusion module integrate 2D cues with 3D backbones, while a scene prior derived from a fixed roadside camera pose provides additional context. The method is detector-agnostic, shows substantial gains over BEVHeight/BEVSpread/BEVDepth on DAIR-V2X-I and Rope3D, and highlights the effectiveness of 2D box coordinates and labels as prompts, stage-wise training benefits, and scene priors. The approach achieves state-of-the-art performance, improves robustness to occlusions and lighting, and offers practical efficiency advantages suitable for real-world deployment.

Abstract

Roadside monocular 3D detection requires detecting objects of predefined classes in an RGB frame and predicting their 3D attributes, such as bird's-eye-view (BEV) locations. It has broad applications in traffic control, vehicle-vehicle communication, and vehicle-infrastructure cooperative perception. To address this task, we introduce Promptable 3D Detector (Pro3D), a novel detector design that leverages 2D detections as prompts. We build our Pro3D upon two key insights. First, compared to a typical 3D detector, a 2D detector is ``easier'' to train due to fewer loss terms and performs significantly better at localizing objects w.r.t 2D metrics. Second, once 2D detections precisely locate objects in the image, a 3D detector can focus on lifting these detections into 3D BEV, especially when fixed camera pose or scene geometry provide an informative prior. To encode and incorporate 2D detections, we explore three methods: (a) concatenating features from both 2D and 3D detectors, (b) attentively fusing 2D and 3D detector features, and (c) encoding properties of predicted 2D bounding boxes \{, , width, height, label\} and attentively fusing them with the 3D detector feature. Interestingly, the third method significantly outperforms the others, underscoring the effectiveness of 2D detections as prompts that offer precise object targets and allow the 3D detector to focus on lifting them into 3D. Pro3D is adaptable for use with a wide range of 2D and 3D detectors with minimal modifications. Comprehensive experiments demonstrate that our Pro3D significantly enhances existing methods, achieving state-of-the-art results on two contemporary benchmarks.
Paper Structure (20 sections, 1 equation, 9 figures, 14 tables)

This paper contains 20 sections, 1 equation, 9 figures, 14 tables.

Figures (9)

  • Figure 1: A summary of our work.(a) To approach roadside monocular 3D detection, we introduce a novel detector design Promptable 3D detector (Pro3D) that exploits 2D detections to prompt the 3D detector for 3D prediction. Pro3D can exploit any 2D detector and 3D detector with minimal modificaitons. (b) The Pro3D design is motivated by the observation that a simple 2D detector DINO zhang2022dino significantly outperforms state-of-the-art roadside monocular 3D detectors (e.g., BEVHeight yang2023BEVHeight and BEVSpread wang2024bevspread) w.r.t 2D metrics (on the DAIR-V2X-I dataset). This implies that 2D detection is an "easier" task than monocular 3D detection -- with a trained 2D detector, training 3D detector can be "simplified" by learning to lift 2D detections to the 3D space. Moreover, as roadside camera pose is fixed, we derive a backbround image as scene prior, incorporating which remarkably boosts 3D detection performance. (c) As a summary of results, Pro3D significantly outperforms prior works on the DAIR-V2X-I benchmark. Refer to Table \ref{['tab:benchmarking-results(0.5,0.25,0.25)']} and \ref{['tab:benchmarking-results of Rope3d(0.7,0.5,0.5)']} for comprehensive results.
  • Figure 2: We study three methods for encoding and fusing 2D detection prompts. Design (a) concatenates feature maps extracted by the 2D detector and the 3D detector's backbone. Design (b) extracts a feature vector based on a 2D detection's coordinates, encodes it through a prompt encoder, and attentively fuses it with the feature map extracted by the 3D detector's backbone. Design (c) encodes a 2D detection, a 5-dim vector (coordinates $x$ and $y$, object width $w$ and height $h$, and the predicted class label) as the prompt, and attentively fuses the encoded detection with the feature map of the 3D detector's backbone. Somewhat surprisingly, the third performs the best (Table \ref{['tab:results-of-different-prompt-information']})!
  • Figure 3: Diagram of our proposed fusion module, which attentively fuse the prompt $p$ (represented by its feature $F_{prompt}(p)$) and feature map $F(X)$ (from the 3D detector's backbone) and ouptut the fused feature $f$.
  • Figure 4: Scene prior generation. We mask out objects of interest in training frames belonging to a specific scene, and average the remaining pixels across frames towards an "empty" background, which is our scene prior. A 3D detector that incorporates this scene prior (Fig. \ref{['fig:overview']}a) achieves remarkable improvements in roadside monocular 3D detection (Table \ref{['tab:background']}).
  • Figure 5: Visual comparison between the state-of-the-art method BEVHeight yang2023BEVHeight and our Pro3D. Results show that Pro3D can (1) make better orientation predictions than BEVHeight (column-1), and (2) detect objects (missed by BEVHeight) that are too small in size (column-2), heavily occluded (column-3), and in the far field (column-4).
  • ...and 4 more figures