Table of Contents
Fetching ...

GATE3D: Generalized Attention-based Task-synergized Estimation in 3D*

Eunsoo Im, Changhyun Jee, Jung Kwon Lee

TL;DR

GATE3D tackles the challenge of generalizing monocular 3D object detection across diverse domains by adopting a weakly supervised framework that uses pseudo-labels and 2D–3D consistency. It introduces architectural innovations (Query Gate, Adaptive Fusion, attention-based region head) and the Virtual Space normalization to handle sensor variability, coupled with a robust pseudo-labeling pipeline and regularization to stabilize learning. Evaluations on KITTI and the Synergy3D-derived Office dataset demonstrate competitive monocular performance and strong cross-domain generalization, highlighting the method’s potential for robotics, AR, and VR applications. The work also provides ablation evidence for its core components, underscoring the practical impact of reducing labeling costs while achieving robust 3D perception in varied environments.

Abstract

The emerging trend in computer vision emphasizes developing universal models capable of simultaneously addressing multiple diverse tasks. Such universality typically requires joint training across multi-domain datasets to ensure effective generalization. However, monocular 3D object detection presents unique challenges in multi-domain training due to the scarcity of datasets annotated with accurate 3D ground-truth labels, especially beyond typical road-based autonomous driving contexts. To address this challenge, we introduce a novel weakly supervised framework leveraging pseudo-labels. Current pretrained models often struggle to accurately detect pedestrians in non-road environments due to inherent dataset biases. Unlike generalized image-based 2D object detection models, achieving similar generalization in monocular 3D detection remains largely unexplored. In this paper, we propose GATE3D, a novel framework designed specifically for generalized monocular 3D object detection via weak supervision. GATE3D effectively bridges domain gaps by employing consistency losses between 2D and 3D predictions. Remarkably, our model achieves competitive performance on the KITTI benchmark as well as on an indoor-office dataset collected by us to evaluate the generalization capabilities of our framework. Our results demonstrate that GATE3D significantly accelerates learning from limited annotated data through effective pre-training strategies, highlighting substantial potential for broader impacts in robotics, augmented reality, and virtual reality applications. Project page: https://ies0411.github.io/GATE3D/

GATE3D: Generalized Attention-based Task-synergized Estimation in 3D*

TL;DR

GATE3D tackles the challenge of generalizing monocular 3D object detection across diverse domains by adopting a weakly supervised framework that uses pseudo-labels and 2D–3D consistency. It introduces architectural innovations (Query Gate, Adaptive Fusion, attention-based region head) and the Virtual Space normalization to handle sensor variability, coupled with a robust pseudo-labeling pipeline and regularization to stabilize learning. Evaluations on KITTI and the Synergy3D-derived Office dataset demonstrate competitive monocular performance and strong cross-domain generalization, highlighting the method’s potential for robotics, AR, and VR applications. The work also provides ablation evidence for its core components, underscoring the practical impact of reducing labeling costs while achieving robust 3D perception in varied environments.

Abstract

The emerging trend in computer vision emphasizes developing universal models capable of simultaneously addressing multiple diverse tasks. Such universality typically requires joint training across multi-domain datasets to ensure effective generalization. However, monocular 3D object detection presents unique challenges in multi-domain training due to the scarcity of datasets annotated with accurate 3D ground-truth labels, especially beyond typical road-based autonomous driving contexts. To address this challenge, we introduce a novel weakly supervised framework leveraging pseudo-labels. Current pretrained models often struggle to accurately detect pedestrians in non-road environments due to inherent dataset biases. Unlike generalized image-based 2D object detection models, achieving similar generalization in monocular 3D detection remains largely unexplored. In this paper, we propose GATE3D, a novel framework designed specifically for generalized monocular 3D object detection via weak supervision. GATE3D effectively bridges domain gaps by employing consistency losses between 2D and 3D predictions. Remarkably, our model achieves competitive performance on the KITTI benchmark as well as on an indoor-office dataset collected by us to evaluate the generalization capabilities of our framework. Our results demonstrate that GATE3D significantly accelerates learning from limited annotated data through effective pre-training strategies, highlighting substantial potential for broader impacts in robotics, augmented reality, and virtual reality applications. Project page: https://ies0411.github.io/GATE3D/

Paper Structure

This paper contains 15 sections, 19 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Qualitative results of Query Diversity
  • Figure 2: GATE3D architecture overview. The proposed framework incorporates a DETR-style 3D detection backbone enhanced with attention-based modules, and supports both fully and weakly supervised learning modes. For ground-truth-labeled samples, the model is trained via standard 3D detection losses. For weakly labeled data, pseudo-3D annotations are generated from 2D detection, monocular depth estimation, and orientation prediction. To mitigate label noise, we introduce a 2D–3D consistency loss that aligns projected 3D box dimensions with frozen 2D predictions. Notably, during weak supervision, the 2D detector remains fixed to preserve its reliability, while only the 3D decoder is optimized. This hybrid learning strategy improves robustness and generalization across diverse domains.
  • Figure 3: Region Head overview
  • Figure 4: Distribution of predicted person heights. This histogram shows the model’s predicted heights for detected people in the evaluation set. The distribution is centered around the average adult height ( 1.7 m), indicating that most detections have realistic scales. A smaller secondary peak at lower heights suggests instances of seated individuals or partial detections, demonstrating that GATE3D’s outputs remain physically plausible.
  • Figure 5: Qualitative 3D detection results on unseen datasets
  • ...and 1 more figures