Table of Contents
Fetching ...

Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors

Haoxuanye Ji, Pengpeng Liang, Erkang Cheng

TL;DR

This paper presents a novel query generating approach termed QAF2D, which infers 3D query anchors from 2D detection results and integrates QAF2D into three popular query-based 3D object detectors and carries out comprehensive evaluations on the nuScenes dataset.

Abstract

Multi-camera-based 3D object detection has made notable progress in the past several years. However, we observe that there are cases (e.g. faraway regions) in which popular 2D object detectors are more reliable than state-of-the-art 3D detectors. In this paper, to improve the performance of query-based 3D object detectors, we present a novel query generating approach termed QAF2D, which infers 3D query anchors from 2D detection results. A 2D bounding box of an object in an image is lifted to a set of 3D anchors by associating each sampled point within the box with depth, yaw angle, and size candidates. Then, the validity of each 3D anchor is verified by comparing its projection in the image with its corresponding 2D box, and only valid anchors are kept and used to construct queries. The class information of the 2D bounding box associated with each query is also utilized to match the predicted boxes with ground truth for the set-based loss. The image feature extraction backbone is shared between the 3D detector and 2D detector by adding a small number of prompt parameters. We integrate QAF2D into three popular query-based 3D object detectors and carry out comprehensive evaluations on the nuScenes dataset. The largest improvement that QAF2D can bring about on the nuScenes validation subset is $2.3\%$ NDS and $2.7\%$ mAP. Code is available at https://github.com/nullmax-vision/QAF2D.

Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors

TL;DR

This paper presents a novel query generating approach termed QAF2D, which infers 3D query anchors from 2D detection results and integrates QAF2D into three popular query-based 3D object detectors and carries out comprehensive evaluations on the nuScenes dataset.

Abstract

Multi-camera-based 3D object detection has made notable progress in the past several years. However, we observe that there are cases (e.g. faraway regions) in which popular 2D object detectors are more reliable than state-of-the-art 3D detectors. In this paper, to improve the performance of query-based 3D object detectors, we present a novel query generating approach termed QAF2D, which infers 3D query anchors from 2D detection results. A 2D bounding box of an object in an image is lifted to a set of 3D anchors by associating each sampled point within the box with depth, yaw angle, and size candidates. Then, the validity of each 3D anchor is verified by comparing its projection in the image with its corresponding 2D box, and only valid anchors are kept and used to construct queries. The class information of the 2D bounding box associated with each query is also utilized to match the predicted boxes with ground truth for the set-based loss. The image feature extraction backbone is shared between the 3D detector and 2D detector by adding a small number of prompt parameters. We integrate QAF2D into three popular query-based 3D object detectors and carry out comprehensive evaluations on the nuScenes dataset. The largest improvement that QAF2D can bring about on the nuScenes validation subset is NDS and mAP. Code is available at https://github.com/nullmax-vision/QAF2D.
Paper Structure (19 sections, 5 equations, 5 figures, 7 tables)

This paper contains 19 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison between randomly generated anchors and anchors generated by our QAF2D and comparison between their corresponding detection results. We use StreamPETR iccv2023streampetr as the baseline. Note that for illustration purpose, we just draw part of the anchors to alleviate clutter.
  • Figure 2: Overview of the 3D detection pipeline with our proposed 3D query anchor generation approach. The image backbone network extracts features of the input multi-view images, and the features are shared between the 3D detector and 2D detector with visual prompts. 2D detection results are used to generate 3D query anchors. Our 3D anchor generation method first generates box size candidates, yaw angle candidates, and 3D center point candidates, and then combines them to construct an initial set of anchors, which is refined with IoU check to form the final set of 3D query anchors. The entire network is optimized in two stages.
  • Figure 3: Visualization results of StreamPETR-8DQuery and StreamPETR-QAF2D. The results in multi-camera images are shown on the left, and the corresponding results in bird's-eye-view are shown on the right. Three typical cases where StreamPETR-8DQuery fails but its QAF2D-enhanced version succeeds are in purple ellipses with numbers.
  • Figure 4: Visualization results of BEVFormer-small-DAB3D and BEVFormer-small-QAF2D. The results in multi-camera images are shown on the left, and the corresponding results in bird’s-eye-view are shown on the right.
  • Figure 5: Visualization results of SparseBEV and SparseBEV-QAF2D. The results in multi-camera images are shown on the left, and the corresponding results in bird’s-eye-view are shown on the right.