Optimizing Sparse Convolution on GPUs with CUDA for 3D Point Cloud Processing in Embedded Systems
Chester Luo, Kevin Lai
TL;DR
The paper tackles the challenge of real-time sparse convolution for 3D point clouds on embedded GPUs by introducing a CUDA-based operator tailored for Jetson-class devices. It proposes a data-layout and algorithmic framework built around a Location Table (LCT) and Offset Table (OFT) to map sparse coordinates into efficient convolution computations, with coalesced input access and shared-memory caching of weights to minimize global memory traffic. A dedicated treatment for Submanifold convolution, along with a two-stage approach for normal sparse convolution and an inverse sparse convolution design, enables both downsampling and upsampling within sparse data regimes. The approach targets real-time, edge-enabled 3D perception applications, offering portability beyond PyTorch-centric implementations while delivering substantial speedups on embedded CUDA platforms.
Abstract
In recent years, there has been a significant increase in the utilization of deep learning methods, particularly convolutional neural networks (CNNs), which have emerged as the dominant approach in various domains that involve structured grid data, such as picture analysis and processing. Nevertheless, the exponential growth in the utilization of LiDAR and 3D sensors across many domains has resulted in an increased need for the analysis of 3D point clouds. The utilization of 3D point clouds is crucial in various applications, including object recognition and segmentation, as they offer a spatial depiction of things within a three-dimensional environment. In contrast to photos, point clouds exhibit sparsity and lack a regular grid, hence posing distinct processing and computational issues.
