Table of Contents
Fetching ...

OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation

Bohao Peng, Xiaoyang Wu, Li Jiang, Yukang Chen, Hengshuang Zhao, Zhuotao Tian, Jiaya Jia

TL;DR

The creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module to greatly enhance the adaptivity of sparse CNNs at minimal computational cost, is proposed.

Abstract

The booming of 3D recognition in the 2020s began with the introduction of point cloud transformers. They quickly overwhelmed sparse CNNs and became state-of-the-art models, especially in 3D semantic segmentation. However, sparse CNNs are still valuable networks, due to their efficiency treasure, and ease of application. In this work, we reexamine the design distinctions and test the limits of what a sparse CNN can achieve. We discover that the key credit to the performance difference is adaptivity. Specifically, we propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap. This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module to greatly enhance the adaptivity of sparse CNNs at minimal computational cost. Without any self-attention modules, OA-CNNs favorably surpass point transformers in terms of accuracy in both indoor and outdoor scenes, with much less latency and memory cost. Notably, it achieves 76.1%, 78.9%, and 70.6% mIoU on ScanNet v2, nuScenes, and SemanticKITTI validation benchmarks respectively, while maintaining at most 5x better speed than transformer counterparts. This revelation highlights the potential of pure sparse CNNs to outperform transformer-related networks.

OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation

TL;DR

The creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module to greatly enhance the adaptivity of sparse CNNs at minimal computational cost, is proposed.

Abstract

The booming of 3D recognition in the 2020s began with the introduction of point cloud transformers. They quickly overwhelmed sparse CNNs and became state-of-the-art models, especially in 3D semantic segmentation. However, sparse CNNs are still valuable networks, due to their efficiency treasure, and ease of application. In this work, we reexamine the design distinctions and test the limits of what a sparse CNN can achieve. We discover that the key credit to the performance difference is adaptivity. Specifically, we propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap. This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module to greatly enhance the adaptivity of sparse CNNs at minimal computational cost. Without any self-attention modules, OA-CNNs favorably surpass point transformers in terms of accuracy in both indoor and outdoor scenes, with much less latency and memory cost. Notably, it achieves 76.1%, 78.9%, and 70.6% mIoU on ScanNet v2, nuScenes, and SemanticKITTI validation benchmarks respectively, while maintaining at most 5x better speed than transformer counterparts. This revelation highlights the potential of pure sparse CNNs to outperform transformer-related networks.
Paper Structure (49 sections, 8 equations, 14 figures, 15 tables)

This paper contains 49 sections, 8 equations, 14 figures, 15 tables.

Figures (14)

  • Figure 1: Visualization of 3D scene receptive fields controlled by our proposed adaptive aggregator. Objects' edges and junctions require smaller receptive fields due to their sophisticated structures, while flat planes and unitary structures require broader fields.
  • Figure 2: Comparison between various transformer-based lai2022stratifiedtransformerzhao2021PointTransformerwu2022PointTransformerV2 and CNN-based graham2018submanifoldchoy2019Minkowskiconvolution within RTX 3090. For OctFormer, we reproduce the official repository and include the cost of building the octree. If a method has multiple versions, they are indicated by different dots.
  • Figure 3: Comparisons between the 3D point-based qi2017pointnetzhao2021PointTransformer and convolutional networks graham2018submanifoldchoy2019Minkowskiconvolution. PointNets directly process the raw points and provide more flexible and broader receptive fields. ConvNets handle structural data after additional voxelization pretreatment with higher efficiency and lower consumption.
  • Figure 4: Illustration for the adaptive aggregator, which learns to aggregate various grid contexts under multi-pyramid scales from the voxel's instinct characteristics.
  • Figure 5: Illustration of the Adaptive Relation Convolution (ARConv). It dynamically generates grid convolution's kernel weights only for the non-empty voxels with their relationships to the centroid voxel.
  • ...and 9 more figures