Table of Contents
Fetching ...

CARLA Drone: Monocular 3D Object Detection from a Different Perspective

Johannes Meier, Luca Scalerandi, Oussema Dhaouadi, Jacques Kaiser, Nikita Araslanov, Daniel Cremers

TL;DR

An effective data augmentation pipeline called GroundMix is developed, which significantly boosts the detection accuracy of a lightweight one-stage detector and achieves the average precision on par with or substantially higher than the previous state of the art across all tested datasets.

Abstract

Existing techniques for monocular 3D detection have a serious restriction. They tend to perform well only on a limited set of benchmarks, faring well either on ego-centric car views or on traffic camera views, but rarely on both. To encourage progress, this work advocates for an extended evaluation of 3D detection frameworks across different camera perspectives. We make two key contributions. First, we introduce the CARLA Drone dataset, CDrone. Simulating drone views, it substantially expands the diversity of camera perspectives in existing benchmarks. Despite its synthetic nature, CDrone represents a real-world challenge. To show this, we confirm that previous techniques struggle to perform well both on CDrone and a real-world 3D drone dataset. Second, we develop an effective data augmentation pipeline called GroundMix. Its distinguishing element is the use of the ground for creating 3D-consistent augmentation of a training image. GroundMix significantly boosts the detection accuracy of a lightweight one-stage detector. In our expanded evaluation, we achieve the average precision on par with or substantially higher than the previous state of the art across all tested datasets.

CARLA Drone: Monocular 3D Object Detection from a Different Perspective

TL;DR

An effective data augmentation pipeline called GroundMix is developed, which significantly boosts the detection accuracy of a lightweight one-stage detector and achieves the average precision on par with or substantially higher than the previous state of the art across all tested datasets.

Abstract

Existing techniques for monocular 3D detection have a serious restriction. They tend to perform well only on a limited set of benchmarks, faring well either on ego-centric car views or on traffic camera views, but rarely on both. To encourage progress, this work advocates for an extended evaluation of 3D detection frameworks across different camera perspectives. We make two key contributions. First, we introduce the CARLA Drone dataset, CDrone. Simulating drone views, it substantially expands the diversity of camera perspectives in existing benchmarks. Despite its synthetic nature, CDrone represents a real-world challenge. To show this, we confirm that previous techniques struggle to perform well both on CDrone and a real-world 3D drone dataset. Second, we develop an effective data augmentation pipeline called GroundMix. Its distinguishing element is the use of the ground for creating 3D-consistent augmentation of a training image. GroundMix significantly boosts the detection accuracy of a lightweight one-stage detector. In our expanded evaluation, we achieve the average precision on par with or substantially higher than the previous state of the art across all tested datasets.
Paper Structure (30 sections, 5 equations, 10 figures, 11 tables)

This paper contains 30 sections, 5 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Views in comparison: Camera locations for the car view (Waymo waymo), the traffic camera view (Rope3D Rope3D) and the drone view (CDrone). For each view, we also show the heatmap of 3D object centers after projection to a normalized image plane.
  • Figure 1: The MonoCon monocon architecture extended for GroundMix.
  • Figure 2: Sample images from our novel CDrone dataset with diverse camera views.
  • Figure 2: Soft pasting in comparison to the pasting in Mix-Teaching mix-teaching. (a) The patch to be pasted. (b) Sampling of the mask ($a_1, a_2 \in [0, 10\%\cdot A]$, $b_1, b_2 \in [0, 20\%\cdot B]$) (c) The corresponding mask, where the black denotes 0% opacity and the white denotes 100% opacity. (d) The result after pasting the object softly with the blending mask (c). (e) The pasting as done in Mix-Teaching mix-teaching. By contrast, our result (d) does not provide visual hints about the 2D bounding box.
  • Figure 3: CDrone statistics in comparison to Rope3D Rope3D and Waymo waymo. CDrone fills the distribution gap in depth of the bounding boxes and offers a more uniform distribution of bounding box orientation w.r.t. the ground normal.
  • ...and 5 more figures