Table of Contents
Fetching ...

HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation

Sha Zhang, Jiajun Deng, Lei Bai, Houqiang Li, Wanli Ouyang, Yanyong Zhang

TL;DR

A hybrid-view-based knowledge distillation framework, termed HVDistill, to guide the feature learning of a point cloud neural network with a pre-trained image network in an unsupervised manner and achieves consistent improvements over the baseline trained from scratch and significantly outperforms the existing schemes.

Abstract

We present a hybrid-view-based knowledge distillation framework, termed HVDistill, to guide the feature learning of a point cloud neural network with a pre-trained image network in an unsupervised manner. By exploiting the geometric relationship between RGB cameras and LiDAR sensors, the correspondence between the two modalities based on both image-plane view and bird-eye view can be established, which facilitates representation learning. Specifically, the image-plane correspondences can be simply obtained by projecting the point clouds, while the bird-eye-view correspondences can be achieved by lifting pixels to the 3D space with the predicted depths under the supervision of projected point clouds. The image teacher networks provide rich semantics from the image-plane view and meanwhile acquire geometric information from the bird-eye view. Indeed, image features from the two views naturally complement each other and together can ameliorate the learned feature representation of the point cloud student networks. Moreover, with a self-supervised pre-trained 2D network, HVDistill requires neither 2D nor 3D annotations. We pre-train our model on nuScenes dataset and transfer it to several downstream tasks on nuScenes, SemanticKITTI, and KITTI datasets for evaluation. Extensive experimental results show that our method achieves consistent improvements over the baseline trained from scratch and significantly outperforms the existing schemes. Codes are available at git@github.com:zhangsha1024/HVDistill.git.

HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation

TL;DR

A hybrid-view-based knowledge distillation framework, termed HVDistill, to guide the feature learning of a point cloud neural network with a pre-trained image network in an unsupervised manner and achieves consistent improvements over the baseline trained from scratch and significantly outperforms the existing schemes.

Abstract

We present a hybrid-view-based knowledge distillation framework, termed HVDistill, to guide the feature learning of a point cloud neural network with a pre-trained image network in an unsupervised manner. By exploiting the geometric relationship between RGB cameras and LiDAR sensors, the correspondence between the two modalities based on both image-plane view and bird-eye view can be established, which facilitates representation learning. Specifically, the image-plane correspondences can be simply obtained by projecting the point clouds, while the bird-eye-view correspondences can be achieved by lifting pixels to the 3D space with the predicted depths under the supervision of projected point clouds. The image teacher networks provide rich semantics from the image-plane view and meanwhile acquire geometric information from the bird-eye view. Indeed, image features from the two views naturally complement each other and together can ameliorate the learned feature representation of the point cloud student networks. Moreover, with a self-supervised pre-trained 2D network, HVDistill requires neither 2D nor 3D annotations. We pre-train our model on nuScenes dataset and transfer it to several downstream tasks on nuScenes, SemanticKITTI, and KITTI datasets for evaluation. Extensive experimental results show that our method achieves consistent improvements over the baseline trained from scratch and significantly outperforms the existing schemes. Codes are available at git@github.com:zhangsha1024/HVDistill.git.
Paper Structure (16 sections, 8 equations, 9 figures, 11 tables)

This paper contains 16 sections, 8 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Comparison between existing schemes and ours. Both (a) and (b) transfer image knowledge to point cloud networks based on the image-plane view. In contrast, we develop a hybrid-view framework (c) to transfer image knowledge based on both the image-plane view and the bird-eye view.
  • Figure 2: An overview of the proposed HVDistill pipeline. Our approach transfers image knowledge from a pre-trained 2D network into a 3D neural network via hybrid-view contrastive distillation. On one hand, the point clouds are grouped into superpoints according to the corresponding superpixels generated on each image, and then supervised by the image features from the 2D teacher network by image-plane view (IPV) based contrastive distillation. On the other hand, the features of images/point clouds from 2D/3D backbones are transformed to the bird-eye view (BEV), and then the image BEV features are used for supervising the point cloud BEV features by contrastive loss. Note that the 2D backbone's parameters are frozen.
  • Figure 3: The figure shows the raw image and superpixels on the left, point clouds in BEV view and superpoints on the middle, and The zoomed-out image focuses on the superpoints surrounding the selected car on the right. Points in the red boxes represent the same area. While points in purple are clustered as superpoints to represent the left car in the image, it contains not only the car (points in black area) but also part of ground (points in green area), introducing ambiguity for points.
  • Figure 4: Performance of different training data for semantic segmentation by fine-tuning on SemanticKITTI. We use 1% training data for fine-tuning.
  • Figure 5: Visualization of the predicted depth of different objects. With sparse point cloud supervision, image features can predict the dense depth and preserve geometric information. Each row represents a specific scene.
  • ...and 4 more figures