Table of Contents
Fetching ...

RayD3D: Distilling Depth Knowledge Along the Ray for Robust Multi-View 3D Object Detection

Rui Ding, Zhaonian Kuang, Zongwei Zhou, Meng Yang, Xinhu Zheng, Gang Hua

TL;DR

This work proposes RayD3D, which transfers crucial depth knowledge along the ray: a line projecting from the camera to true location of an object, and significantly improves the robustness of all the three base models in all scenarios without increasing inference costs.

Abstract

Multi-view 3D detection with bird's eye view (BEV) is crucial for autonomous driving and robotics, but its robustness in real-world is limited as it struggles to predict accurate depth values. A mainstream solution, cross-modal distillation, transfers depth information from LiDAR to camera models but also unintentionally transfers depth-irrelevant information (e.g. LiDAR density). To mitigate this issue, we propose RayD3D, which transfers crucial depth knowledge along the ray: a line projecting from the camera to true location of an object. It is based on the fundamental imaging principle that predicted location of this object can only vary along this ray, which is finally determined by predicted depth value. Therefore, distilling along the ray enables more effective depth information transfer. More specifically, we design two ray-based distillation modules. Ray-based Contrastive Distillation (RCD) incorporates contrastive learning into distillation by sampling along the ray to learn how LiDAR accurately locates objects. Ray-based Weighted Distillation (RWD) adaptively adjusts distillation weight based on the ray to minimize the interference of depth-irrelevant information in LiDAR. For validation, we widely apply RayD3D into three representative types of BEV-based models, including BEVDet, BEVDepth4D, and BEVFormer. Our method is trained on clean NuScenes, and tested on both clean NuScenes and RoboBEV with a variety types of data corruptions. Our method significantly improves the robustness of all the three base models in all scenarios without increasing inference costs, and achieves the best when compared to recently released multi-view and distillation models.

RayD3D: Distilling Depth Knowledge Along the Ray for Robust Multi-View 3D Object Detection

TL;DR

This work proposes RayD3D, which transfers crucial depth knowledge along the ray: a line projecting from the camera to true location of an object, and significantly improves the robustness of all the three base models in all scenarios without increasing inference costs.

Abstract

Multi-view 3D detection with bird's eye view (BEV) is crucial for autonomous driving and robotics, but its robustness in real-world is limited as it struggles to predict accurate depth values. A mainstream solution, cross-modal distillation, transfers depth information from LiDAR to camera models but also unintentionally transfers depth-irrelevant information (e.g. LiDAR density). To mitigate this issue, we propose RayD3D, which transfers crucial depth knowledge along the ray: a line projecting from the camera to true location of an object. It is based on the fundamental imaging principle that predicted location of this object can only vary along this ray, which is finally determined by predicted depth value. Therefore, distilling along the ray enables more effective depth information transfer. More specifically, we design two ray-based distillation modules. Ray-based Contrastive Distillation (RCD) incorporates contrastive learning into distillation by sampling along the ray to learn how LiDAR accurately locates objects. Ray-based Weighted Distillation (RWD) adaptively adjusts distillation weight based on the ray to minimize the interference of depth-irrelevant information in LiDAR. For validation, we widely apply RayD3D into three representative types of BEV-based models, including BEVDet, BEVDepth4D, and BEVFormer. Our method is trained on clean NuScenes, and tested on both clean NuScenes and RoboBEV with a variety types of data corruptions. Our method significantly improves the robustness of all the three base models in all scenarios without increasing inference costs, and achieves the best when compared to recently released multi-view and distillation models.
Paper Structure (26 sections, 10 equations, 4 figures, 6 tables)

This paper contains 26 sections, 10 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Ray prior for cross-modal distillation. (a) The line projecting from the camera to true location of an object forms a ray in both front view and BEV. Predicted location of this object can only vary along this ray, which is finally determined by predicted depth value. (b) LiDAR has accurate depth. However, inaccurate depth from camera leads to inaccurate location of object along the ray. Therefore, distilling along the ray enables more effective depth transfer. (c) The accuracy (MATE) of predicted depth drops from 0.72 to 1.00 when data corruptions occur, consequently, the accuracy (NDS) of 3D detection drops from 37.20 to 6.06.
  • Figure 2: Visual examples of our RayD3D. The accuracy of multi-view 3D detection drops significantly in real-world scenarios. Our method consistently demonstrates strong robustness on both clean data and various types of data corruptions, whether data corruptions are seen or not during training.
  • Figure 3: Overview of our RayD3D framework. It includes a teacher network, a student network, and two novel ray-based distillation modules. First, we train the LiDAR-based teacher network and freeze its parameters. The student network is then initialized randomly and trained with our RCD and RWD distillation modules. During inference, only the student network is used and evaluated on both clean and corrupted data. It significantly enhances the accuracy and robustness in real-world scenarios without increasing inference costs.
  • Figure 4: Distillation weight in RWD. For rays with large differences between camera and LiDAR features, we increase weight to transfer more depth information for correction. For rays with small differences, we reduce weight to avoid the interference of depth-irrelevant information on the camera model itself.