Table of Contents
Fetching ...

MonoCT: Overcoming Monocular 3D Detection Domain Shift with Consistent Teacher Models

Johannes Meier, Louis Inchingolo, Oussema Dhaouadi, Yan Xia, Jacques Kaiser, Daniel Cremers

TL;DR

MonoCT addresses domain shift in monocular 3D object detection by introducing a Consistent Teacher framework that self-trains on unlabeled target data. It leverages Generalized Depth Enhancement to produce robust depth estimates, combines multi-source depth cues via kernel density estimation, and uses Pseudo Label Scoring together with Ensemble Merging and Diversity Maximization to curate diverse, high-quality pseudo labels. Across six benchmarks, MonoCT significantly outperforms state-of-the-art domain adaptation methods and generalizes well to car, traffic-camera, and drone viewpoints, while keeping inference-time cost unchanged. The work offers practical improvements for real-world deployment where labeled data in the target domain is scarce or unavailable.

Abstract

We tackle the problem of monocular 3D object detection across different sensors, environments, and camera setups. In this paper, we introduce a novel unsupervised domain adaptation approach, MonoCT, that generates highly accurate pseudo labels for self-supervision. Inspired by our observation that accurate depth estimation is critical to mitigating domain shifts, MonoCT introduces a novel Generalized Depth Enhancement (GDE) module with an ensemble concept to improve depth estimation accuracy. Moreover, we introduce a novel Pseudo Label Scoring (PLS) module by exploring inner-model consistency measurement and a Diversity Maximization (DM) strategy to further generate high-quality pseudo labels for self-training. Extensive experiments on six benchmarks show that MonoCT outperforms existing SOTA domain adaptation methods by large margins (~21% minimum for AP Mod.) and generalizes well to car, traffic camera and drone views.

MonoCT: Overcoming Monocular 3D Detection Domain Shift with Consistent Teacher Models

TL;DR

MonoCT addresses domain shift in monocular 3D object detection by introducing a Consistent Teacher framework that self-trains on unlabeled target data. It leverages Generalized Depth Enhancement to produce robust depth estimates, combines multi-source depth cues via kernel density estimation, and uses Pseudo Label Scoring together with Ensemble Merging and Diversity Maximization to curate diverse, high-quality pseudo labels. Across six benchmarks, MonoCT significantly outperforms state-of-the-art domain adaptation methods and generalizes well to car, traffic-camera, and drone viewpoints, while keeping inference-time cost unchanged. The work offers practical improvements for real-world deployment where labeled data in the target domain is scarce or unavailable.

Abstract

We tackle the problem of monocular 3D object detection across different sensors, environments, and camera setups. In this paper, we introduce a novel unsupervised domain adaptation approach, MonoCT, that generates highly accurate pseudo labels for self-supervision. Inspired by our observation that accurate depth estimation is critical to mitigating domain shifts, MonoCT introduces a novel Generalized Depth Enhancement (GDE) module with an ensemble concept to improve depth estimation accuracy. Moreover, we introduce a novel Pseudo Label Scoring (PLS) module by exploring inner-model consistency measurement and a Diversity Maximization (DM) strategy to further generate high-quality pseudo labels for self-training. Extensive experiments on six benchmarks show that MonoCT outperforms existing SOTA domain adaptation methods by large margins (~21% minimum for AP Mod.) and generalizes well to car, traffic camera and drone views.

Paper Structure

This paper contains 15 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: MonoCT generates precise pseudo labels to overcome domain shift in monocular 3D object detection, utilizing inner-model and multi-model consistency. These labels are then used to self-supervise the detector. Our approach accommodates car-traffic-drone views and greatly improves the current SOTA methods.
  • Figure 2: Overview of our MonoCT for pseudo label generation: (1) GDE: We convert auxiliary 2D BBox and 2D corner keypoint predictions into depth estimates, which are then merged into a single depth estimate using a KDE. (2) PLS: To identify the most accurate pseudo labels, we assess the standard deviation of the KDE and evaluate 2D/3D BBox consistency. (3) EM: We enhance predictions further by employing an ensemble of five teacher models. (4) DM: We filter pseudo labels for quality and rotation diversity to optimize self-training.
  • Figure 3: Diversity Maximization (DM): Without dm, pseudo labels are concentrated at $\pm \frac{\pi}{2}$. With dm, high-quality labels are more evenly distributed across all orientations (Lyft lyft$\rightarrow$ KITTI kitti).
  • Figure 4: Top: pls filters inaccurate pseudo labels better than default scoring. Bottom: KDE-based depth merging is more robust to outliers than wbf weighted_box_fusion.
  • Figure 5: Qualitative Results. Lyft lyft$\rightarrow$ KITTI kitti (Column 1): Our pseudo label supervision allows MonoCT to predict depth more accurately than MonoCon monocon, as evident in the bev visualization. Rope3D rope3d and CDrone cdrone (Columns 2 and 3): We compare our baseline MonoCon monocon (row 1) with MonoCT (row 2). MonoCT demonstrates superior detection of occluded and distant objects in new environments.