Table of Contents
Fetching ...

Aerial Monocular 3D Object Detection

Yue Hu, Shaoheng Fang, Weidi Xie, Siheng Chen

TL;DR

This work tackles aerial monocular 3D object detection by introducing DVDET, a dual-view system that performs 2D RV and 3D BEV detections from a single aerial image. It couples a trainable geo-deformable transformation with categorical altitude estimation to address severe aerial-view deformation and altitude ambiguity, enabling robust BEV warping. The authors create AM3D-Sim and AM3D-Real datasets and demonstrate that sim-to-real pretraining benefits real-world performance, with DVDET also improving monocular 3D detection in autonomous driving benchmarks. The proposed approach advances 3D scene understanding for drones and offers practical potential for enhanced aerial perception and cross-domain gains to driving scenarios.

Abstract

Drones equipped with cameras can significantly enhance human ability to perceive the world because of their remarkable maneuverability in 3D space. Ironically, object detection for drones has always been conducted in the 2D image space, which fundamentally limits their ability to understand 3D scenes. Furthermore, existing 3D object detection methods developed for autonomous driving cannot be directly applied to drones due to the lack of deformation modeling, which is essential for the distant aerial perspective with sensitive distortion and small objects. To fill the gap, this work proposes a dual-view detection system named DVDET to achieve aerial monocular object detection in both the 2D image space and the 3D physical space. To address the severe view deformation issue, we propose a novel trainable geo-deformable transformation module that can properly warp information from the drone's perspective to the BEV. Compared to the monocular methods for cars, our transformation includes a learnable deformable network for explicitly revising the severe deviation. To address the dataset challenge, we propose a new large-scale simulation dataset named AM3D-Sim, generated by the co-simulation of AirSIM and CARLA, and a new real-world aerial dataset named AM3D-Real, collected by DJI Matrice 300 RTK, in both datasets, high-quality annotations for 3D object detection are provided. Extensive experiments show that i) aerial monocular 3D object detection is feasible; ii) the model pre-trained on the simulation dataset benefits real-world performance, and iii) DVDET also benefits monocular 3D object detection for cars. To encourage more researchers to investigate this area, we will release the dataset and related code in https://github.com/PhyllisH/DVDET.

Aerial Monocular 3D Object Detection

TL;DR

This work tackles aerial monocular 3D object detection by introducing DVDET, a dual-view system that performs 2D RV and 3D BEV detections from a single aerial image. It couples a trainable geo-deformable transformation with categorical altitude estimation to address severe aerial-view deformation and altitude ambiguity, enabling robust BEV warping. The authors create AM3D-Sim and AM3D-Real datasets and demonstrate that sim-to-real pretraining benefits real-world performance, with DVDET also improving monocular 3D detection in autonomous driving benchmarks. The proposed approach advances 3D scene understanding for drones and offers practical potential for enhanced aerial perception and cross-domain gains to driving scenarios.

Abstract

Drones equipped with cameras can significantly enhance human ability to perceive the world because of their remarkable maneuverability in 3D space. Ironically, object detection for drones has always been conducted in the 2D image space, which fundamentally limits their ability to understand 3D scenes. Furthermore, existing 3D object detection methods developed for autonomous driving cannot be directly applied to drones due to the lack of deformation modeling, which is essential for the distant aerial perspective with sensitive distortion and small objects. To fill the gap, this work proposes a dual-view detection system named DVDET to achieve aerial monocular object detection in both the 2D image space and the 3D physical space. To address the severe view deformation issue, we propose a novel trainable geo-deformable transformation module that can properly warp information from the drone's perspective to the BEV. Compared to the monocular methods for cars, our transformation includes a learnable deformable network for explicitly revising the severe deviation. To address the dataset challenge, we propose a new large-scale simulation dataset named AM3D-Sim, generated by the co-simulation of AirSIM and CARLA, and a new real-world aerial dataset named AM3D-Real, collected by DJI Matrice 300 RTK, in both datasets, high-quality annotations for 3D object detection are provided. Extensive experiments show that i) aerial monocular 3D object detection is feasible; ii) the model pre-trained on the simulation dataset benefits real-world performance, and iii) DVDET also benefits monocular 3D object detection for cars. To encourage more researchers to investigate this area, we will release the dataset and related code in https://github.com/PhyllisH/DVDET.
Paper Structure (18 sections, 9 equations, 8 figures, 8 tables)

This paper contains 18 sections, 9 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Our dual-view object detection system simultaneously detects the objects in both 2D range view (RV) and 3D birds' eye view (BEV), given a 2D aerial image. Colors denote BEV detections at various altitudes, where 0 is the horizontal plane, and negatives are the planes below the horizontal.
  • Figure 2: Directly transforming 2D detection from RV to BEV fails due to the small object size and severe deformation issue. Green, blue, and red is the ground truth, directly transformation of 2D detection, and detection of DVDET. Note that the BEV image is for reference only. Its pixel values are inaccurate due to deformation and loss of altitude information.
  • Figure 3: The overall framework of aerial monocular 3D object detection. First, a backbone is utilized to extract the RV feature from the image data. Second, the altitude estimation module predicts the categorical altitude level for each RV feature point, afterwards, a geometric transformation is performed to get the categorical altitude level for each coordinate in BEV. Third, the RV feature and the estimated altitudes are output to a geo-deformable transformation module to generate the BEV feature. Finally, the BEV feature is decoded to the object bounding boxes with orientation.
  • Figure 4: DVDET simultaneously localizes objects in image and 3D space.
  • Figure 5: DVDET is robust and could alleviate the severe deformation issues at high altitude.
  • ...and 3 more figures