Optimizing Sparse Convolution on GPUs with CUDA for 3D Point Cloud Processing in Embedded Systems

Chester Luo; Kevin Lai

Optimizing Sparse Convolution on GPUs with CUDA for 3D Point Cloud Processing in Embedded Systems

Chester Luo, Kevin Lai

TL;DR

The paper tackles the challenge of real-time sparse convolution for 3D point clouds on embedded GPUs by introducing a CUDA-based operator tailored for Jetson-class devices. It proposes a data-layout and algorithmic framework built around a Location Table (LCT) and Offset Table (OFT) to map sparse coordinates into efficient convolution computations, with coalesced input access and shared-memory caching of weights to minimize global memory traffic. A dedicated treatment for Submanifold convolution, along with a two-stage approach for normal sparse convolution and an inverse sparse convolution design, enables both downsampling and upsampling within sparse data regimes. The approach targets real-time, edge-enabled 3D perception applications, offering portability beyond PyTorch-centric implementations while delivering substantial speedups on embedded CUDA platforms.

Abstract

In recent years, there has been a significant increase in the utilization of deep learning methods, particularly convolutional neural networks (CNNs), which have emerged as the dominant approach in various domains that involve structured grid data, such as picture analysis and processing. Nevertheless, the exponential growth in the utilization of LiDAR and 3D sensors across many domains has resulted in an increased need for the analysis of 3D point clouds. The utilization of 3D point clouds is crucial in various applications, including object recognition and segmentation, as they offer a spatial depiction of things within a three-dimensional environment. In contrast to photos, point clouds exhibit sparsity and lack a regular grid, hence posing distinct processing and computational issues.

Optimizing Sparse Convolution on GPUs with CUDA for 3D Point Cloud Processing in Embedded Systems

TL;DR

Abstract

Paper Structure (20 sections, 7 equations, 10 figures, 2 algorithms)

This paper contains 20 sections, 7 equations, 10 figures, 2 algorithms.

Introduction
Background and Motivation
Structure and Represention for Point Cloud Data
Dense Convolution v.s. Sparse Convolution
Related Work
Our Approach
Input and Output Structure
Sampling Indices Mapping
Position Identification
Start Cell Determination
Offsets Calculation
Position Mapping in the Convolved Space
Kernel Position Mapping
Submanifold Convolution
Create Location Table (LCT)
...and 5 more sections

Figures (10)

Figure 1: Voxelization of 3D points: (a) Original TLS points; and (b) Voxelized point cloud. Source XU2021103675 The structured voxel grid offers a simplified and computationally efficient approach for 3D data analysis, suitable for applications in digital elevation modeling, urban planning, and 3D simulations.
Figure 2: Illustration of the data structures for sparse tensor for voxel or point cloud. On the left is the indices structure, depicting the point coordinates (m, x, y, z) for each batch ranging from 0 to M+1, where m is the batch ID. On the right is the features structure, showcasing the organization of feature values across channels, rows, and batches.
Figure 3: Illustrative comparison of a 2D dense matrix and a 2D sparse matrix. The dense matrix predominantly features non-zero elements, whereas the sparse matrix consists mainly of zero values with a few non-zero entries scattered throughout. Such distinctions highlight the storage and computational differences between the two matrix types.
Figure 4: Illustration of the convolution process using the 'im2col' approach, source vasudevan2017parallel. Starting with a set of kernels $K$ and an input $I$, the method reshapes the input into overlapping patches. Correspondingly, kernels are reshaped into rows to form the kernel-patch-matrix $K'$. Matrix multiplication between $K'$ and the input-patch-matrix $I'$ yields the output-patch-matrix $O'$. The final output $O$ is derived by reshaping $O'$ to its intended dimensions.
Figure 5: Illustration of submanifold convolution on a 2D matrix. The input matrix $I$ is convolved with the kernel $K$. Only the central non-zero elements (highlighted in orange) undergo computation. Regions with a central zero (highlighted in red) are excluded from the convolution process.
...and 5 more figures

Optimizing Sparse Convolution on GPUs with CUDA for 3D Point Cloud Processing in Embedded Systems

TL;DR

Abstract

Optimizing Sparse Convolution on GPUs with CUDA for 3D Point Cloud Processing in Embedded Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (10)