Table of Contents
Fetching ...

TiGDistill-BEV: Multi-view BEV 3D Object Detection via Target Inner-Geometry Learning Distillation

Shaoqing Xu, Fang Li, Peixiang Huang, Ziying Song, Zhi-Xin Yang

TL;DR

TiGDistill-BEV tackles the gap between LiDAR- and camera-based multi-view BEV detectors by transferring rich cross-modal knowledge to a camera-only student. Its Target Inner-Geometry Learning Distillation combines inner-depth supervision for object-internal depth structure with inner-feature BEV distillation for high-level foreground semantics, augmented by inter-channel and inter-keypoint relations to bridge modality gaps. On nuScenes, the approach yields state-of-the-art camera-based performance, exemplified by 62.8% NDS and 53.9% mAP on the test set, and notable gains across val results and depth metrics. The method demonstrates that exploiting foreground inner-geometry and cross-modal semantics can substantially enhance camera-only BEV 3D detection while mitigating the reliance on dense LiDAR supervision.

Abstract

Accurate multi-view 3D object detection is essential for applications such as autonomous driving. Researchers have consistently aimed to leverage LiDAR's precise spatial information to enhance camera-based detectors through methods like depth supervision and bird-eye-view (BEV) feature distillation. However, existing approaches often face challenges due to the inherent differences between LiDAR and camera data representations. In this paper, we introduce the TiGDistill-BEV, a novel approach that effectively bridges this gap by leveraging the strengths of both sensors. Our method distills knowledge from diverse modalities(e.g., LiDAR) as the teacher model to a camera-based student detector, utilizing the Target Inner-Geometry learning scheme to enhance camera-based BEV detectors through both depth and BEV features by leveraging diverse modalities. Specially, we propose two key modules: an inner-depth supervision module to learn the low-level relative depth relations within objects which equips detectors with a deeper understanding of object-level spatial structures, and an inner-feature BEV distillation module to transfer high-level semantics of different key points within foreground targets. To further alleviate the domain gap, we incorporate both inter-channel and inter-keypoint distillation to model feature similarity. Extensive experiments on the nuScenes benchmark demonstrate that TiGDistill-BEV significantly boosts camera-based only detectors achieving a state-of-the-art with 62.8% NDS and surpassing previous methods by a significant margin. The codes is available at: https://github.com/Public-BOTs/TiGDistill-BEV.git.

TiGDistill-BEV: Multi-view BEV 3D Object Detection via Target Inner-Geometry Learning Distillation

TL;DR

TiGDistill-BEV tackles the gap between LiDAR- and camera-based multi-view BEV detectors by transferring rich cross-modal knowledge to a camera-only student. Its Target Inner-Geometry Learning Distillation combines inner-depth supervision for object-internal depth structure with inner-feature BEV distillation for high-level foreground semantics, augmented by inter-channel and inter-keypoint relations to bridge modality gaps. On nuScenes, the approach yields state-of-the-art camera-based performance, exemplified by 62.8% NDS and 53.9% mAP on the test set, and notable gains across val results and depth metrics. The method demonstrates that exploiting foreground inner-geometry and cross-modal semantics can substantially enhance camera-only BEV 3D detection while mitigating the reliance on dense LiDAR supervision.

Abstract

Accurate multi-view 3D object detection is essential for applications such as autonomous driving. Researchers have consistently aimed to leverage LiDAR's precise spatial information to enhance camera-based detectors through methods like depth supervision and bird-eye-view (BEV) feature distillation. However, existing approaches often face challenges due to the inherent differences between LiDAR and camera data representations. In this paper, we introduce the TiGDistill-BEV, a novel approach that effectively bridges this gap by leveraging the strengths of both sensors. Our method distills knowledge from diverse modalities(e.g., LiDAR) as the teacher model to a camera-based student detector, utilizing the Target Inner-Geometry learning scheme to enhance camera-based BEV detectors through both depth and BEV features by leveraging diverse modalities. Specially, we propose two key modules: an inner-depth supervision module to learn the low-level relative depth relations within objects which equips detectors with a deeper understanding of object-level spatial structures, and an inner-feature BEV distillation module to transfer high-level semantics of different key points within foreground targets. To further alleviate the domain gap, we incorporate both inter-channel and inter-keypoint distillation to model feature similarity. Extensive experiments on the nuScenes benchmark demonstrate that TiGDistill-BEV significantly boosts camera-based only detectors achieving a state-of-the-art with 62.8% NDS and surpassing previous methods by a significant margin. The codes is available at: https://github.com/Public-BOTs/TiGDistill-BEV.git.
Paper Structure (43 sections, 10 equations, 8 figures, 13 tables)

This paper contains 43 sections, 10 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Different LiDAR-to-Camera Learning Schemes: (a) Dense Depth Supervision, which directly supervises the categorial depth distribution of every valid pixel in the whole depth map, (b) BEV Feature Distillation, which constrainedly aligns the value of BEV feature between different modalities, (c) Our Target Inner-Geometry Learning, which utilizes both the low-level inner-depth relations and the high-level inner-feature semantics of foreground targets.
  • Figure 2: Inner-depth Supervision. We guide the camera-based detector to learn the relative spatial structures within the target foreground areas. A depth reference point (dotted in yellow) is adaptively selected to calculate relative depth.
  • Figure 3: Inner-feature BEV Distillation. We conduct inter-channel and inter-keypoint feature distillation in BEV space for the camera-based detector, which alleviates the cross-modal semantic gap and boosts inner-geometry learning.
  • Figure 4: Overall Framework of TiGDistill-BEV, which contains a pre-trained teacher model, a camera-based detector as student, and a target inner-geometry scheme for cross-modal learning. Our proposed learning paradigm bridges the modalities gap by transferring the inner-geometry semantics from the teacher modality via two components, an inner-depth supervision for foreground relative depth, and an inner-feature BEV distillation from both channel-wise and keypoint-wise.
  • Figure 5: Comparison Categorical Absolute Depth and Continuous Inner Depth. Employing the inner-depth supervision with continuous depth values to guide camera-based student to learn local spatial structures of foreground object targets.
  • ...and 3 more figures