3D Semantic Segmentation with Submanifold Sparse Convolutional Networks
Benjamin Graham, Martin Engelcke, Laurens van der Maaten
TL;DR
The paper tackles the computational burden of applying convolutional nets to sparse 3D data by introducing submanifold sparse convolutions (SSC) and constructing submanifold sparse convolutional networks (SSCNs). These operators preserve the sparsity pattern across layers, avoiding dilation and enabling efficient, deep architectures for 3D semantic segmentation. Through extensive experiments on ShapeNet and NYU Depth v2, SSCNs outperform strong baselines in both accuracy (IoU) and efficiency (FLOPs), including competitive results in a semantic segmentation competition. The hash-table plus rule-book implementation underpins scalable training and inference, highlighting the practical impact for real-time 3D scene understanding.
Abstract
Convolutional networks are the de-facto standard for analyzing spatio-temporal data such as images, videos, and 3D shapes. Whilst some of this data is naturally dense (e.g., photos), many other data sources are inherently sparse. Examples include 3D point clouds that were obtained using a LiDAR scanner or RGB-D camera. Standard "dense" implementations of convolutional networks are very inefficient when applied on such sparse data. We introduce new sparse convolutional operations that are designed to process spatially-sparse data more efficiently, and use them to develop spatially-sparse convolutional networks. We demonstrate the strong performance of the resulting models, called submanifold sparse convolutional networks (SSCNs), on two tasks involving semantic segmentation of 3D point clouds. In particular, our models outperform all prior state-of-the-art on the test set of a recent semantic segmentation competition.
