Table of Contents
Fetching ...

MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model

Jihyeok Kim, Seongwoo Moon, Sungwon Nah, David Hyunchul Shim

TL;DR

Monocular 3D object detection faces depth estimation challenges from single images. The authors propose MonoDINO-DETR, a one-stage detector that uses a vision foundation model (DINOv2) as the backbone, enhanced by a Hierarchical Feature Fusion Block and 6D Dynamic Anchor Boxes, with depth estimation aided by transfer learning from Depth Anything V2. The approach achieves state-of-the-art performance on KITTI and shows strong generalization on a high-elevation racing dataset, all without relying on LiDAR data. This work demonstrates the practical potential of integrating large pre-trained vision models with DETR-based detection for robust monocular 3D perception in diverse environments. It also points to future work on extending 3D bounding box estimation to account for vehicle rolling and pitching angles, broadening applicability in autonomous racing and urban scenarios.

Abstract

This paper proposes novel methods to enhance the performance of monocular 3D object detection models by leveraging the generalized feature extraction capabilities of a vision foundation model. Unlike traditional CNN-based approaches, which often suffer from inaccurate depth estimation and rely on multi-stage object detection pipelines, this study employs a Vision Transformer (ViT)-based foundation model as the backbone, which excels at capturing global features for depth estimation. It integrates a detection transformer (DETR) architecture to improve both depth estimation and object detection performance in a one-stage manner. Specifically, a hierarchical feature fusion block is introduced to extract richer visual features from the foundation model, further enhancing feature extraction capabilities. Depth estimation accuracy is further improved by incorporating a relative depth estimation model trained on large-scale data and fine-tuning it through transfer learning. Additionally, the use of queries in the transformer's decoder, which consider reference points and the dimensions of 2D bounding boxes, enhances recognition performance. The proposed model outperforms recent state-of-the-art methods, as demonstrated through quantitative and qualitative evaluations on the KITTI 3D benchmark and a custom dataset collected from high-elevation racing environments. Code is available at https://github.com/JihyeokKim/MonoDINO-DETR.

MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model

TL;DR

Monocular 3D object detection faces depth estimation challenges from single images. The authors propose MonoDINO-DETR, a one-stage detector that uses a vision foundation model (DINOv2) as the backbone, enhanced by a Hierarchical Feature Fusion Block and 6D Dynamic Anchor Boxes, with depth estimation aided by transfer learning from Depth Anything V2. The approach achieves state-of-the-art performance on KITTI and shows strong generalization on a high-elevation racing dataset, all without relying on LiDAR data. This work demonstrates the practical potential of integrating large pre-trained vision models with DETR-based detection for robust monocular 3D perception in diverse environments. It also points to future work on extending 3D bounding box estimation to account for vehicle rolling and pitching angles, broadening applicability in autonomous racing and urban scenarios.

Abstract

This paper proposes novel methods to enhance the performance of monocular 3D object detection models by leveraging the generalized feature extraction capabilities of a vision foundation model. Unlike traditional CNN-based approaches, which often suffer from inaccurate depth estimation and rely on multi-stage object detection pipelines, this study employs a Vision Transformer (ViT)-based foundation model as the backbone, which excels at capturing global features for depth estimation. It integrates a detection transformer (DETR) architecture to improve both depth estimation and object detection performance in a one-stage manner. Specifically, a hierarchical feature fusion block is introduced to extract richer visual features from the foundation model, further enhancing feature extraction capabilities. Depth estimation accuracy is further improved by incorporating a relative depth estimation model trained on large-scale data and fine-tuning it through transfer learning. Additionally, the use of queries in the transformer's decoder, which consider reference points and the dimensions of 2D bounding boxes, enhances recognition performance. The proposed model outperforms recent state-of-the-art methods, as demonstrated through quantitative and qualitative evaluations on the KITTI 3D benchmark and a custom dataset collected from high-elevation racing environments. Code is available at https://github.com/JihyeokKim/MonoDINO-DETR.

Paper Structure

This paper contains 15 sections, 2 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of the racing track environment and a comparison of detection results between the proposed model and the state-of-the-art model.
  • Figure 2: Overall Structure of MonoDINO-DETR. The proposed method, MonoDINO-DETR, is composed of four main components: the Feature Extraction Module, the Object-Wise Supervision Module, the Depth-Aware Transformer, and the MLP-Based Detection Heads. The visual feature extraction process is represented in green, while the depth feature extraction process is represented in blue.
  • Figure 3: Overall Structure of Feature Extraction Module. The Feature Extraction Module is divided into three components: the DINOv2 backbone, the Visual Feature Extraction Module, and the Depth Feature Extraction Module. The Hierarchical Feature Fusion Block serves as the key module for visual features, while the combination of the DPT Head and DINOv2, which together form the Depth Anything V2 architecture, serves as the key module for depth features.
  • Figure 4: Representation of 6D Dynamic Anchor Boxes. 6D DAB extends 4D DAB by refining the reference point $(x, y)$ and the distances to the left, right, top, and bottom edges $(l, r, t, b)$ at each layer. This iterative refinement improves the model's adaptability to asymmetric object shapes.
  • Figure 5: Indy Race Car Platform. Synchronized camera image data and LiDAR data were acquired during the race using a front-mounted Luminar Iris LiDAR sensor and a front-left Mako G-319C camera.
  • ...and 3 more figures