TiG-BEV: Multi-view BEV 3D Object Detection via Target Inner-Geometry Learning

Peixiang Huang; Li Liu; Renrui Zhang; Song Zhang; Xinli Xu; Baichao Wang; Guoyi Liu

TiG-BEV: Multi-view BEV 3D Object Detection via Target Inner-Geometry Learning

Peixiang Huang, Li Liu, Renrui Zhang, Song Zhang, Xinli Xu, Baichao Wang, Guoyi Liu

TL;DR

The paper introduces self-assessment knowledge distillation, where a student uses cross-attention with a teacher to identify and minimize discrepancies between feature spaces. It further adds sequence-level and anchor-point distillation to enable efficient, robust transfer for both classification and semantic segmentation. Empirically, the approach yields consistent gains on CIFAR-100, ImageNet, Pascal VOC, and Cityscapes, supported by theoretical analysis suggesting the cross-attention should converge to an identity mapping under reasonable conditions. The work offers a practical, scalable KD framework that extends beyond image classification to segmentation tasks, with clear ablations detailing the contributions of its two distillation modules.

Abstract

To achieve accurate and low-cost 3D object detection, existing methods propose to benefit camera-based multi-view detectors with spatial cues provided by the LiDAR modality, e.g., dense depth supervision and bird-eye-view (BEV) feature distillation. However, they directly conduct point-to-point mimicking from LiDAR to camera, which neglects the inner-geometry of foreground targets and suffers from the modal gap between 2D-3D features. In this paper, we propose the learning scheme of Target Inner-Geometry from the LiDAR modality into camera-based BEV detectors for both dense depth and BEV features, termed as TiG-BEV. First, we introduce an inner-depth supervision module to learn the low-level relative depth relations between different foreground pixels. This enables the camera-based detector to better understand the object-wise spatial structures. Second, we design an inner-feature BEV distillation module to imitate the high-level semantics of different keypoints within foreground targets. To further alleviate the BEV feature gap between two modalities, we adopt both inter-channel and inter-keypoint distillation for feature-similarity modeling. With our target inner-geometry distillation, TiG-BEV can effectively boost BEVDepth by +2.3% NDS and +2.4% mAP, along with BEVDet by +9.1% NDS and +10.3% mAP on nuScenes val set. Code will be available at https://github.com/ADLab3Ds/TiG-BEV.

TiG-BEV: Multi-view BEV 3D Object Detection via Target Inner-Geometry Learning

TL;DR

Abstract

Paper Structure (15 sections, 14 equations, 2 figures, 11 tables)

This paper contains 15 sections, 14 equations, 2 figures, 11 tables.

Introduction
Related Works
Method
Formulation
Empirical & Theoretical Analysis
Sequence-level Distillation
Anchor-point Distillation
Experiment
Datasets
Implementation Details
Results on Cifar-100
Result on ImageNet
Result of Semantic Segmentation
Ablation & Sensitivity Study
Conclusion

Figures (2)

Figure 1: (a) shows the statistics of covariance and correlation coefficient between feature points of the teacher feature cross spatial dimensions. (b) is an example of self-attention of the teacher feature approaching the identity matrix. We further theoretically demonstrate the self-attention of the teacher feature is to be an identity matrix. Base on this finding, this work designs the novel self-assessment loss for knowledge distillation which requires the cross-attention between student and teacher to be identity matrix.
Figure 2: Illustration of our framework.(a) Self-assessment Distillation. Given a pair of features of student and teacher, the cross-attention map $\rm{attn}$ is first computed and then applied on the student feature to generate the output $\rm{out}$, which is then asked to minimize the L$_2$ loss with the corresponding teacher feature. (b) Anchor-point Distillation. Each color indicates a region. We use average pooling to extract the anchor within a local area of the given feature map, forming the new feature of smaller size. The generated anchor-point features will participate the self-assessment distillation. (c) Sequence-level Distillation. Both teacher and student features are to be sliced and rearranged as sequences. We use MHA heads to calculate the cross-attention for the later self-assessment distillation.

TiG-BEV: Multi-view BEV 3D Object Detection via Target Inner-Geometry Learning

TL;DR

Abstract

TiG-BEV: Multi-view BEV 3D Object Detection via Target Inner-Geometry Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (2)