Table of Contents
Fetching ...

GeoAuxNet: Towards Universal 3D Representation Learning for Multi-sensor Point Clouds

Shengjun Zhang, Xin Fei, Yueqi Duan

TL;DR

GeoAuxNet tackles the domain gaps between RGB-D and LiDAR point clouds by enabling universal 3D representation learning through geometry-to-voxel auxiliary learning. It introduces voxel-guided dynamic point networks, hierarchical geometry pools, and a geometry-to-voxel fusion mechanism to inject point-level geometry into voxel backbones without increasing inference cost. The method achieves strong improvements on multi-sensor semantic segmentation and remains competitive with single-sensor experts, while also offering practical efficiency advantages. This work advances universal 3D representation learning for heterogeneous point clouds and provides a scalable framework for cross-sensor understanding.

Abstract

Point clouds captured by different sensors such as RGB-D cameras and LiDAR possess non-negligible domain gaps. Most existing methods design different network architectures and train separately on point clouds from various sensors. Typically, point-based methods achieve outstanding performances on even-distributed dense point clouds from RGB-D cameras, while voxel-based methods are more efficient for large-range sparse LiDAR point clouds. In this paper, we propose geometry-to-voxel auxiliary learning to enable voxel representations to access point-level geometric information, which supports better generalisation of the voxel-based backbone with additional interpretations of multi-sensor point clouds. Specifically, we construct hierarchical geometry pools generated by a voxel-guided dynamic point network, which efficiently provide auxiliary fine-grained geometric information adapted to different stages of voxel features. We conduct experiments on joint multi-sensor datasets to demonstrate the effectiveness of GeoAuxNet. Enjoying elaborate geometric information, our method outperforms other models collectively trained on multi-sensor datasets, and achieve competitive results with the-state-of-art experts on each single dataset.

GeoAuxNet: Towards Universal 3D Representation Learning for Multi-sensor Point Clouds

TL;DR

GeoAuxNet tackles the domain gaps between RGB-D and LiDAR point clouds by enabling universal 3D representation learning through geometry-to-voxel auxiliary learning. It introduces voxel-guided dynamic point networks, hierarchical geometry pools, and a geometry-to-voxel fusion mechanism to inject point-level geometry into voxel backbones without increasing inference cost. The method achieves strong improvements on multi-sensor semantic segmentation and remains competitive with single-sensor experts, while also offering practical efficiency advantages. This work advances universal 3D representation learning for heterogeneous point clouds and provides a scalable framework for cross-sensor understanding.

Abstract

Point clouds captured by different sensors such as RGB-D cameras and LiDAR possess non-negligible domain gaps. Most existing methods design different network architectures and train separately on point clouds from various sensors. Typically, point-based methods achieve outstanding performances on even-distributed dense point clouds from RGB-D cameras, while voxel-based methods are more efficient for large-range sparse LiDAR point clouds. In this paper, we propose geometry-to-voxel auxiliary learning to enable voxel representations to access point-level geometric information, which supports better generalisation of the voxel-based backbone with additional interpretations of multi-sensor point clouds. Specifically, we construct hierarchical geometry pools generated by a voxel-guided dynamic point network, which efficiently provide auxiliary fine-grained geometric information adapted to different stages of voxel features. We conduct experiments on joint multi-sensor datasets to demonstrate the effectiveness of GeoAuxNet. Enjoying elaborate geometric information, our method outperforms other models collectively trained on multi-sensor datasets, and achieve competitive results with the-state-of-art experts on each single dataset.
Paper Structure (24 sections, 9 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 9 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Semantic segmentation results on S3DIS S3DIS2016CVPR and ScanNet ScanNet2017CVPR from RGB-D cameras and SemanticKITTI SemanticKITTI2019ICCV from LiDAR. For all methods, we trained collectively on three datasets. Our method outperforms other methods with better detailed structures.
  • Figure 2: The pipeline of our GeoAuxNet. For a complete scene point cloud $\mathcal{P}^{\mathcal{C}}$ and a local point patch $\mathcal{P}\subseteq\mathcal{P}^{\mathcal{C}}$, our voxel-based backbone first voxelizes $\mathcal{P}^{\mathcal{C}}$ and conducts sparse convolutional operations. The voxel-guided hypernetwork takes relative positions, voxel features and a stage latent code as input to provide weights and biases for the point network. Then, we encode the spatial information for $\mathcal{P}$ with the point network and aggregate local features to generate geometric feature candidates. Following the update strategy, we construct hierarchical geometry pools. The geometry-to-voxel mechanism fuses geometric features stored in the pools to enable voxel representations to access point-level geometric information. We repeat the above process several times to extract effective representation hierarchically and predict the results with a voxel decoder for the primary task and a point decoder for the auxiliary task. The dotted line stands for the course of the auxiliary learning which is ignored during inference to ensure efficiency. Geo-to-Vox is abbreviation of Geometry-to-Voxel.
  • Figure 3: Visualization of the cosine similarity between features. The purple star is the selected point, and the green star is the point with a significantly similar feature to the purple star. We calculate the cosine similarity between the feature of the red star and other features and visualize them in image (b), (d) and (f). The nearest neighbors of the purple and green stars are marked red in image (a), (c) and (e). (a), (b), (c) and (e) are generated from the first stage of the point network and (e), (f) are from the second stage.
  • Figure 4: Statistical results of cosine similarity between hierarchical geometry pools and point-level features extracted by the point network from point clouds in S3DIS S3DIS2016CVPR at different stages.
  • Figure 5: Cosine similarity between geometry pools. (a) shows the similarity between geometry pools of S3DIS S3DIS2016CVPR and ScanNet ScanNet2017CVPR. (b) shows the similarity between geometry pools of S3DIS S3DIS2016CVPR and SemanticKITTI SemanticKITTI2019ICCV. Intra-sensor geometry pools in (a) have higher similarity than inter-sensor geometry pools in (b).