Table of Contents
Fetching ...

3D Semantic Segmentation with Submanifold Sparse Convolutional Networks

Benjamin Graham, Martin Engelcke, Laurens van der Maaten

TL;DR

The paper tackles the computational burden of applying convolutional nets to sparse 3D data by introducing submanifold sparse convolutions (SSC) and constructing submanifold sparse convolutional networks (SSCNs). These operators preserve the sparsity pattern across layers, avoiding dilation and enabling efficient, deep architectures for 3D semantic segmentation. Through extensive experiments on ShapeNet and NYU Depth v2, SSCNs outperform strong baselines in both accuracy (IoU) and efficiency (FLOPs), including competitive results in a semantic segmentation competition. The hash-table plus rule-book implementation underpins scalable training and inference, highlighting the practical impact for real-time 3D scene understanding.

Abstract

Convolutional networks are the de-facto standard for analyzing spatio-temporal data such as images, videos, and 3D shapes. Whilst some of this data is naturally dense (e.g., photos), many other data sources are inherently sparse. Examples include 3D point clouds that were obtained using a LiDAR scanner or RGB-D camera. Standard "dense" implementations of convolutional networks are very inefficient when applied on such sparse data. We introduce new sparse convolutional operations that are designed to process spatially-sparse data more efficiently, and use them to develop spatially-sparse convolutional networks. We demonstrate the strong performance of the resulting models, called submanifold sparse convolutional networks (SSCNs), on two tasks involving semantic segmentation of 3D point clouds. In particular, our models outperform all prior state-of-the-art on the test set of a recent semantic segmentation competition.

3D Semantic Segmentation with Submanifold Sparse Convolutional Networks

TL;DR

The paper tackles the computational burden of applying convolutional nets to sparse 3D data by introducing submanifold sparse convolutions (SSC) and constructing submanifold sparse convolutional networks (SSCNs). These operators preserve the sparsity pattern across layers, avoiding dilation and enabling efficient, deep architectures for 3D semantic segmentation. Through extensive experiments on ShapeNet and NYU Depth v2, SSCNs outperform strong baselines in both accuracy (IoU) and efficiency (FLOPs), including competitive results in a semantic segmentation competition. The hash-table plus rule-book implementation underpins scalable training and inference, highlighting the practical impact for real-time 3D scene understanding.

Abstract

Convolutional networks are the de-facto standard for analyzing spatio-temporal data such as images, videos, and 3D shapes. Whilst some of this data is naturally dense (e.g., photos), many other data sources are inherently sparse. Examples include 3D point clouds that were obtained using a LiDAR scanner or RGB-D camera. Standard "dense" implementations of convolutional networks are very inefficient when applied on such sparse data. We introduce new sparse convolutional operations that are designed to process spatially-sparse data more efficiently, and use them to develop spatially-sparse convolutional networks. We demonstrate the strong performance of the resulting models, called submanifold sparse convolutional networks (SSCNs), on two tasks involving semantic segmentation of 3D point clouds. In particular, our models outperform all prior state-of-the-art on the test set of a recent semantic segmentation competition.

Paper Structure

This paper contains 23 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Examples of 3D point clouds of objects from the ShapeNet part-segmentation challenge yi2017large. The colors of the points represent the part labels.
  • Figure 2: Example of "submanifold" dilation. Left: Original curve. Middle: Result of applying a regular $3 \times 3$ convolution with weights $1/9$. Right: Result of applying the same convolution again. The example shows that regular convolutions substantially reduce the sparsity of the features with each convolutional layer.
  • Figure 3: SSC$(\cdot,\cdot,3)$ receptive field centered at different active spatial locations. Active locations in the field are shown in green. Red locations are ignored by SSC so the pattern of active locations remains unchanged.
  • Figure 4: Illustrations of our submanifold sparse FCN (a) and U-Net (b) architectures. Dark blue boxes represents one or more "pre-activated" SSC$(\cdot,\cdot,3)$ convolutions, which may have residual connections. Red boxes represent size-2, stride-2 downsampling convolutions; green deconvolutions "invert" these convolutions. Purple upsampling boxes perform "nearest-neighbor" upsampling. The final linear and softmax layers are applied separately on each active input voxel.
  • Figure 5: Average interaction-over-union (IoU) on the test set of SSCNs trained for 3D semantic segmentation on the ShapeNet competition data set (higher is better).
  • ...and 1 more figures