Table of Contents
Fetching ...

MDHA: Multi-Scale Deformable Transformer with Hybrid Anchors for Multi-View 3D Object Detection

Michelle Adeline, Junn Yong Loo, Vishnu Monn Baskaran

Abstract

Multi-view 3D object detection is a crucial component of autonomous driving systems. Contemporary query-based methods primarily depend either on dataset-specific initialization of 3D anchors, introducing bias, or utilize dense attention mechanisms, which are computationally inefficient and unscalable. To overcome these issues, we present MDHA, a novel sparse query-based framework, which constructs adaptive 3D output proposals using hybrid anchors from multi-view, multi-scale image input. Fixed 2D anchors are combined with depth predictions to form 2.5D anchors, which are projected to obtain 3D proposals. To ensure high efficiency, our proposed Anchor Encoder performs sparse refinement and selects the top-$k$ anchors and features. Moreover, while existing multi-view attention mechanisms rely on projecting reference points to multiple images, our novel Circular Deformable Attention mechanism only projects to a single image but allows reference points to seamlessly attend to adjacent images, improving efficiency without compromising on performance. On the nuScenes val set, it achieves 46.4\% mAP and 55.0\% NDS with a ResNet101 backbone. MDHA significantly outperforms the baseline where anchor proposals are modelled as learnable embeddings. Code is available at https://github.com/NaomiEX/MDHA.

MDHA: Multi-Scale Deformable Transformer with Hybrid Anchors for Multi-View 3D Object Detection

Abstract

Multi-view 3D object detection is a crucial component of autonomous driving systems. Contemporary query-based methods primarily depend either on dataset-specific initialization of 3D anchors, introducing bias, or utilize dense attention mechanisms, which are computationally inefficient and unscalable. To overcome these issues, we present MDHA, a novel sparse query-based framework, which constructs adaptive 3D output proposals using hybrid anchors from multi-view, multi-scale image input. Fixed 2D anchors are combined with depth predictions to form 2.5D anchors, which are projected to obtain 3D proposals. To ensure high efficiency, our proposed Anchor Encoder performs sparse refinement and selects the top- anchors and features. Moreover, while existing multi-view attention mechanisms rely on projecting reference points to multiple images, our novel Circular Deformable Attention mechanism only projects to a single image but allows reference points to seamlessly attend to adjacent images, improving efficiency without compromising on performance. On the nuScenes val set, it achieves 46.4\% mAP and 55.0\% NDS with a ResNet101 backbone. MDHA significantly outperforms the baseline where anchor proposals are modelled as learnable embeddings. Code is available at https://github.com/NaomiEX/MDHA.

Paper Structure

This paper contains 20 sections, 15 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison between PETR and Sparse4D models with our proposed architecture. (a) PETR models encode 3D information via positional embeddings and refine queries using dense attention within their decoder. (b) Sparse4D models initialize 3D anchors from k-means clustering on nuScenes and iteratively refine both queries and anchors using sparse attention within the decoder. (c) MDHA employs image tokens as queries; the $(x,y)$ center coordinates of each token, paired with depth predictions, form 2.5D anchors, which are projected into 3D anchors. This eliminates the need for good anchor initialization. Anchors and queries undergo iterative refinement using multi-view-spanning sparse attention within the decoder.
  • Figure 2: Our proposed MDHA Architecture. Multi-view images are fed into the backbone and FPN neck to extract multi-scale, multi-view image features. These feature tokens serve as input for the DepthNet, which pairs their 2D center coordinates with predicted depth, forming 2.5D anchors, which are projected to obtain 3D output proposals. The tokens and proposals undergo refinement in the 1-layer MDHA Anchor Encoder. It also selects the top-$k$ queries and proposals for further refinement in the MDHA Decoder, which additionally considers temporal information via the memory queue.
  • Figure 3: Depth distribution of 3D object centers projected onto all 6 cameras in the nuScenes train set. The brighter the colour, the farther away the object. The semicircular voids at the bottom of the front and back cameras represent the front and rear part of the vehicle which juts out and where no objects can be located.
  • Figure 4: CDA on horizontally concatenated input. The circles denote reference points and the arrows represent sampling locations.
  • Figure 5: Qualitative comparison between 3D proposals obtained from the learnable anchors setting (top) and our MDHA Anchor Encoder (bottom). We visualize selected proposals on all 6 cameras (left) and all proposals in bird's-eye-view (right). For visual clarity, we display non-overlapping proposals for learnable anchors, and the top-20 proposals based on classification score for MDHA.