Table of Contents
Fetching ...

MonoDETRNext: Next-Generation Accurate and Efficient Monocular 3D Object Detector

Pan Liao, Feng Yang, Di Wu, Wenhui Zhao, Jinwen Yu

TL;DR

The proposed MonoDETRNext, a model that comes in two variants based on the choice of depth estimator: MonoDETRNext-E, which prioritizes speed, and MonoDETRNext-A, which focuses on accuracy, establishes a new benchmark in monocular 3D object detection and opens avenues for future research.

Abstract

Monocular 3D object detection has vast application potential across various fields. DETR-type models have shown remarkable performance in different areas, but there is still considerable room for improvement in monocular 3D detection, especially with the existing DETR-based method, MonoDETR. After addressing the query initialization issues in MonoDETR, we explored several performance enhancement strategies, such as incorporating a more efficient encoder and utilizing a more powerful depth estimator. Ultimately, we proposed MonoDETRNext, a model that comes in two variants based on the choice of depth estimator: MonoDETRNext-E, which prioritizes speed, and MonoDETRNext-A, which focuses on accuracy. We posit that MonoDETRNext establishes a new benchmark in monocular 3D object detection and opens avenues for future research. We conducted an exhaustive evaluation demonstrating the model's superior performance against existing solutions. Notably, MonoDETRNext-A demonstrated a 3.52$\%$ improvement in the $AP_{3D}$ metric on the KITTI test benchmark over MonoDETR, while MonoDETRNext-E showed a 2.35$\%$ increase. Additionally, the computational efficiency of MonoDETRNext-E slightly exceeds that of its predecessor.

MonoDETRNext: Next-Generation Accurate and Efficient Monocular 3D Object Detector

TL;DR

The proposed MonoDETRNext, a model that comes in two variants based on the choice of depth estimator: MonoDETRNext-E, which prioritizes speed, and MonoDETRNext-A, which focuses on accuracy, establishes a new benchmark in monocular 3D object detection and opens avenues for future research.

Abstract

Monocular 3D object detection has vast application potential across various fields. DETR-type models have shown remarkable performance in different areas, but there is still considerable room for improvement in monocular 3D detection, especially with the existing DETR-based method, MonoDETR. After addressing the query initialization issues in MonoDETR, we explored several performance enhancement strategies, such as incorporating a more efficient encoder and utilizing a more powerful depth estimator. Ultimately, we proposed MonoDETRNext, a model that comes in two variants based on the choice of depth estimator: MonoDETRNext-E, which prioritizes speed, and MonoDETRNext-A, which focuses on accuracy. We posit that MonoDETRNext establishes a new benchmark in monocular 3D object detection and opens avenues for future research. We conducted an exhaustive evaluation demonstrating the model's superior performance against existing solutions. Notably, MonoDETRNext-A demonstrated a 3.52 improvement in the metric on the KITTI test benchmark over MonoDETR, while MonoDETRNext-E showed a 2.35 increase. Additionally, the computational efficiency of MonoDETRNext-E slightly exceeds that of its predecessor.
Paper Structure (23 sections, 8 equations, 5 figures, 9 tables)

This paper contains 23 sections, 8 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Comparison of DETR-type 3D detection models, with different colors representing distinct functional modules.
  • Figure 2: The schematic depiction of MonoDETRNext. The distinction between MonoDETRNext-A and MonoDETRNext-E primarily resides in their respective depth prediction mechanisms. The provided illustration delineates the intricate depth prediction scheme adopted by MonoDETRNext-A, whereas the depth predictor employed in MonoDETRNext-E remains congruent with that of MonoDETR.
  • Figure 3: The differences in details between our decoder and MonoDETR's decoder, the main difference between the two being the source of the embedding and the presence or absence of an initial anchors.
  • Figure 4: The fusion block within CFIM is depicted, showcasing the architectures of the Sequential Dilated Convolution (SDC) module and the Regional-Global Feature Interaction (RGFI) module proposed therein.
  • Figure 5: The visualization results of the final attention output from the decoder of MonoDETR and MonoDETRNext. Hotter colors indicate higher attention weights. It is evident that the heat map of MonoDETRNext is more concentrated than that of MonoDETR.