Table of Contents
Fetching ...

Unified Domain Generalization and Adaptation for Multi-View 3D Object Detection

Gyusam Chang, Jiwon Lee, Donghyun Kim, Jinkyu Kim, Dongwook Lee, Daehyun Ji, Sujin Jang, Sangpil Kim

TL;DR

This paper proposes Unified Domain Generalization and Adaptation (UDGA), a practical solution to mitigate drawbacks of typical supervised learning approaches, and presents a Label-Efficient Domain Adaptation approach to handle unfamiliar targets with significantly fewer amounts of labels.

Abstract

Recent advances in 3D object detection leveraging multi-view cameras have demonstrated their practical and economical value in various challenging vision tasks. However, typical supervised learning approaches face challenges in achieving satisfactory adaptation toward unseen and unlabeled target datasets (\ie, direct transfer) due to the inevitable geometric misalignment between the source and target domains. In practice, we also encounter constraints on resources for training models and collecting annotations for the successful deployment of 3D object detectors. In this paper, we propose Unified Domain Generalization and Adaptation (UDGA), a practical solution to mitigate those drawbacks. We first propose Multi-view Overlap Depth Constraint that leverages the strong association between multi-view, significantly alleviating geometric gaps due to perspective view changes. Then, we present a Label-Efficient Domain Adaptation approach to handle unfamiliar targets with significantly fewer amounts of labels (\ie, 1$\%$ and 5$\%)$, while preserving well-defined source knowledge for training efficiency. Overall, UDGA framework enables stable detection performance in both source and target domains, effectively bridging inevitable domain gaps, while demanding fewer annotations. We demonstrate the robustness of UDGA with large-scale benchmarks: nuScenes, Lyft, and Waymo, where our framework outperforms the current state-of-the-art methods.

Unified Domain Generalization and Adaptation for Multi-View 3D Object Detection

TL;DR

This paper proposes Unified Domain Generalization and Adaptation (UDGA), a practical solution to mitigate drawbacks of typical supervised learning approaches, and presents a Label-Efficient Domain Adaptation approach to handle unfamiliar targets with significantly fewer amounts of labels.

Abstract

Recent advances in 3D object detection leveraging multi-view cameras have demonstrated their practical and economical value in various challenging vision tasks. However, typical supervised learning approaches face challenges in achieving satisfactory adaptation toward unseen and unlabeled target datasets (\ie, direct transfer) due to the inevitable geometric misalignment between the source and target domains. In practice, we also encounter constraints on resources for training models and collecting annotations for the successful deployment of 3D object detectors. In this paper, we propose Unified Domain Generalization and Adaptation (UDGA), a practical solution to mitigate those drawbacks. We first propose Multi-view Overlap Depth Constraint that leverages the strong association between multi-view, significantly alleviating geometric gaps due to perspective view changes. Then, we present a Label-Efficient Domain Adaptation approach to handle unfamiliar targets with significantly fewer amounts of labels (\ie, 1 and 5, while preserving well-defined source knowledge for training efficiency. Overall, UDGA framework enables stable detection performance in both source and target domains, effectively bridging inevitable domain gaps, while demanding fewer annotations. We demonstrate the robustness of UDGA with large-scale benchmarks: nuScenes, Lyft, and Waymo, where our framework outperforms the current state-of-the-art methods.

Paper Structure

This paper contains 21 sections, 11 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Comparison of performance in both source and target domains (Tab. \ref{['tab:UDGA2']}). Here, "Average" (orange dots) refers to mean NDS in both the source and target domains. We draw comparisons with prior methods CAM-Conv facil2019cam, DG-BEV wang2023towards and PD-BEV lu2023towards offering an empirical lower and upper bounds, DT and Oracle. Note that we only use 5$\%$ of the target label for Domain Adaptation.
  • Figure 2: (a) An illustration of multi-view installation translation difference. The first (i.e., source) and second (i.e., target) rows are two perspective views of the same scene captured from different installation points. The translation gap between these views is substantial, approximately 30$\%$. (b) Source trained network shows poor perception capability in target domain, primarily due to extrinsic shifts. In $\Delta$Height, mAP and NDS have dropped up to -67$\%$ compared to source. Note that we simulate the camera extrinsic shift leveraging CARLA Dosovitskiy17 (refer to Appendix \ref{['apx:dataset']} for further details).
  • Figure 3: An overview of our proposed methodologies. Our proposed methods comprise two major parts: (i) Multi-view Overlap Depth Constraint and (ii) Label-Efficient Domain Adaptation (LEDA). In addition, our framework employs two phases (i.e., pre-training, and then fine-tuning). Note that we adopt our proposed depth constraint in both phases, and LEDA only in the fine-tuning phase.
  • Figure 4: Performance relative to training parameters. The Domain Generalization task is represented in blue, while the Domain Adaptation task is divided into two stages: 1$\%$ in gray and 100$\%$ in red.
  • Figure 5: Qualitative depth visualizations of front view lineups in Lyft. The top row illustrates sparse depth ground truths projected from LiDAR point clouds. The middle and bottom rows are the qualitative results of BEVDepth and Ours, respectively. Yellow boxes highlight the improved depth.
  • ...and 2 more figures