TiGDistill-BEV: Multi-view BEV 3D Object Detection via Target Inner-Geometry Learning Distillation
Shaoqing Xu, Fang Li, Peixiang Huang, Ziying Song, Zhi-Xin Yang
TL;DR
TiGDistill-BEV tackles the gap between LiDAR- and camera-based multi-view BEV detectors by transferring rich cross-modal knowledge to a camera-only student. Its Target Inner-Geometry Learning Distillation combines inner-depth supervision for object-internal depth structure with inner-feature BEV distillation for high-level foreground semantics, augmented by inter-channel and inter-keypoint relations to bridge modality gaps. On nuScenes, the approach yields state-of-the-art camera-based performance, exemplified by 62.8% NDS and 53.9% mAP on the test set, and notable gains across val results and depth metrics. The method demonstrates that exploiting foreground inner-geometry and cross-modal semantics can substantially enhance camera-only BEV 3D detection while mitigating the reliance on dense LiDAR supervision.
Abstract
Accurate multi-view 3D object detection is essential for applications such as autonomous driving. Researchers have consistently aimed to leverage LiDAR's precise spatial information to enhance camera-based detectors through methods like depth supervision and bird-eye-view (BEV) feature distillation. However, existing approaches often face challenges due to the inherent differences between LiDAR and camera data representations. In this paper, we introduce the TiGDistill-BEV, a novel approach that effectively bridges this gap by leveraging the strengths of both sensors. Our method distills knowledge from diverse modalities(e.g., LiDAR) as the teacher model to a camera-based student detector, utilizing the Target Inner-Geometry learning scheme to enhance camera-based BEV detectors through both depth and BEV features by leveraging diverse modalities. Specially, we propose two key modules: an inner-depth supervision module to learn the low-level relative depth relations within objects which equips detectors with a deeper understanding of object-level spatial structures, and an inner-feature BEV distillation module to transfer high-level semantics of different key points within foreground targets. To further alleviate the domain gap, we incorporate both inter-channel and inter-keypoint distillation to model feature similarity. Extensive experiments on the nuScenes benchmark demonstrate that TiGDistill-BEV significantly boosts camera-based only detectors achieving a state-of-the-art with 62.8% NDS and surpassing previous methods by a significant margin. The codes is available at: https://github.com/Public-BOTs/TiGDistill-BEV.git.
