Table of Contents
Fetching ...

FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection

Zheng Jiang, Jinqing Zhang, Yanan Zhang, Qingjie Liu, Zhenghui Hu, Baohui Wang, Yunhong Wang

TL;DR

The paper addresses the performance gap between BEV-based multi-view 3D detectors and LiDAR-based methods by introducing Foreground Self-Distillation (FSD), which eliminates the need for pre-trained teacher models. FSD-BEV combines foreground-guided self-distillation with Frame Combination and Pseudo Point Assignment to alleviate label sparsity, plus a Multi-Scale Foreground Enhancement module to fuse foreground cues across scales. The approach yields state-of-the-art results on nuScenes, with ablations showing additive benefits from each component and favorable comparisons to cross-modal distillation. This framework enables joint teacher-student learning within a single model, improving robustness to background noise and sparse point clouds, and has practical implications for deployment where heavy teacher models are impractical.

Abstract

Although multi-view 3D object detection based on the Bird's-Eye-View (BEV) paradigm has garnered widespread attention as an economical and deployment-friendly perception solution for autonomous driving, there is still a performance gap compared to LiDAR-based methods. In recent years, several cross-modal distillation methods have been proposed to transfer beneficial information from teacher models to student models, with the aim of enhancing performance. However, these methods face challenges due to discrepancies in feature distribution originating from different data modalities and network structures, making knowledge transfer exceptionally challenging. In this paper, we propose a Foreground Self-Distillation (FSD) scheme that effectively avoids the issue of distribution discrepancies, maintaining remarkable distillation effects without the need for pre-trained teacher models or cumbersome distillation strategies. Additionally, we design two Point Cloud Intensification (PCI) strategies to compensate for the sparsity of point clouds by frame combination and pseudo point assignment. Finally, we develop a Multi-Scale Foreground Enhancement (MSFE) module to extract and fuse multi-scale foreground features by predicted elliptical Gaussian heatmap, further improving the model's performance. We integrate all the above innovations into a unified framework named FSD-BEV. Extensive experiments on the nuScenes dataset exhibit that FSD-BEV achieves state-of-the-art performance, highlighting its effectiveness. The code and models are available at: https://github.com/CocoBoom/fsd-bev.

FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection

TL;DR

The paper addresses the performance gap between BEV-based multi-view 3D detectors and LiDAR-based methods by introducing Foreground Self-Distillation (FSD), which eliminates the need for pre-trained teacher models. FSD-BEV combines foreground-guided self-distillation with Frame Combination and Pseudo Point Assignment to alleviate label sparsity, plus a Multi-Scale Foreground Enhancement module to fuse foreground cues across scales. The approach yields state-of-the-art results on nuScenes, with ablations showing additive benefits from each component and favorable comparisons to cross-modal distillation. This framework enables joint teacher-student learning within a single model, improving robustness to background noise and sparse point clouds, and has practical implications for deployment where heavy teacher models are impractical.

Abstract

Although multi-view 3D object detection based on the Bird's-Eye-View (BEV) paradigm has garnered widespread attention as an economical and deployment-friendly perception solution for autonomous driving, there is still a performance gap compared to LiDAR-based methods. In recent years, several cross-modal distillation methods have been proposed to transfer beneficial information from teacher models to student models, with the aim of enhancing performance. However, these methods face challenges due to discrepancies in feature distribution originating from different data modalities and network structures, making knowledge transfer exceptionally challenging. In this paper, we propose a Foreground Self-Distillation (FSD) scheme that effectively avoids the issue of distribution discrepancies, maintaining remarkable distillation effects without the need for pre-trained teacher models or cumbersome distillation strategies. Additionally, we design two Point Cloud Intensification (PCI) strategies to compensate for the sparsity of point clouds by frame combination and pseudo point assignment. Finally, we develop a Multi-Scale Foreground Enhancement (MSFE) module to extract and fuse multi-scale foreground features by predicted elliptical Gaussian heatmap, further improving the model's performance. We integrate all the above innovations into a unified framework named FSD-BEV. Extensive experiments on the nuScenes dataset exhibit that FSD-BEV achieves state-of-the-art performance, highlighting its effectiveness. The code and models are available at: https://github.com/CocoBoom/fsd-bev.
Paper Structure (33 sections, 8 equations, 7 figures, 7 tables)

This paper contains 33 sections, 8 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of cross-modal distillation framework and our self-distillation framework. The hard labels denote the depth maps and foreground segmentation generated by LiDAR point clouds, the soft labels are those predicted by the student. The teacher branch and student branch share the same image features and thus mitigate the distribution discrepancies between the distillation targets.
  • Figure 2: The overall architecture of the proposed FSD-BEV. The features enhanced by the foreground heatmap are fed into the View Transformation Module to generate the student BEV. The teacher branch generates the teacher BEV by combining hard labels with soft labels from the student branch. Subsequently, they are concatenated along the batch dimension for subsequent joint training and undergo distillation operations before entering the detection head.
  • Figure 3: Details of BEV feature generation. Due to the sparse nature of the depth label, the teacher branch cannot acquire dense foreground information. To address this limitation, we use soft labels generated by the student branch to assist the teacher branch in generating denser depth maps.
  • Figure 4: Overview of Point Cloud Intensification. In (a), we enhance the current frame by merging points of static objects from the adjacent frames in time. In (b), green and red points represent foreground and background points, respectively. We employ the PPA strategy to complete missing point clouds for foreground objects.
  • Figure A: Comparison of performance between baseline (BEVDepth) and FSD-BEV during training. FSD-BEV is divided into student and teacher branches, and we evaluate mAP and NDS on the nuScenes $val$ set.
  • ...and 2 more figures