Table of Contents
Fetching ...

Selectively Dilated Convolution for Accuracy-Preserving Sparse Pillar-based Embedded 3D Object Detection

Seongmin Park, Minjae Lee, Junwon Choi, Jungwook Choi

TL;DR

This paper tackles the inefficiency of dense pillar processing in real-time 3D object detection by introducing Selectively Dilated Convolution (SD-Conv), which uses pillar-level importance to perform fine-grained, intra-stage dilation and improve fine-grained spatial information flow. A lightweight hardware augmentation, SPADE+, enables practical acceleration of SD-Conv on embedded accelerators with minimal area cost. Across multiple pillar-based detectors and benchmarks, SD-Conv achieves substantial FLOP reductions and speedups while preserving or even improving accuracy, demonstrating significant potential for real-time autonomous driving deployments. The combination of SD-Conv and SPADE+ provides a practical path to leverage extreme pillar sparsity without sacrificing detection performance on resource-constrained hardware.

Abstract

Pillar-based 3D object detection has gained traction in self-driving technology due to its speed and accuracy facilitated by the artificial densification of pillars for GPU-friendly processing. However, dense pillar processing fundamentally wastes computation since it ignores the inherent sparsity of pillars derived from scattered point cloud data. Motivated by recent embedded accelerators with native sparsity support, sparse pillar convolution methods like submanifold convolution (SubM-Conv) aimed to reduce these redundant computations by applying convolution only on active pillars but suffered considerable accuracy loss. Our research identifies that this accuracy loss is due to the restricted fine-grained spatial information flow (fSIF) of SubM-Conv in sparse pillar networks. To overcome this restriction, we propose a selectively dilated (SD-Conv) convolution that evaluates the importance of encoded pillars and selectively dilates the convolution output, enhancing the receptive field for critical pillars and improving object detection accuracy. To facilitate actual acceleration with this novel convolution approach, we designed SPADE+ as a cost-efficient augmentation to existing embedded sparse convolution accelerators. This design supports the SD-Conv without significant demands in area and SRAM size, realizing superior trade-off between the speedup and model accuracy. This strategic enhancement allows our method to achieve extreme pillar sparsity, leading to up to 18.1x computational savings and 16.2x speedup on the embedded accelerators, without compromising object detection accuracy.

Selectively Dilated Convolution for Accuracy-Preserving Sparse Pillar-based Embedded 3D Object Detection

TL;DR

This paper tackles the inefficiency of dense pillar processing in real-time 3D object detection by introducing Selectively Dilated Convolution (SD-Conv), which uses pillar-level importance to perform fine-grained, intra-stage dilation and improve fine-grained spatial information flow. A lightweight hardware augmentation, SPADE+, enables practical acceleration of SD-Conv on embedded accelerators with minimal area cost. Across multiple pillar-based detectors and benchmarks, SD-Conv achieves substantial FLOP reductions and speedups while preserving or even improving accuracy, demonstrating significant potential for real-time autonomous driving deployments. The combination of SD-Conv and SPADE+ provides a practical path to leverage extreme pillar sparsity without sacrificing detection performance on resource-constrained hardware.

Abstract

Pillar-based 3D object detection has gained traction in self-driving technology due to its speed and accuracy facilitated by the artificial densification of pillars for GPU-friendly processing. However, dense pillar processing fundamentally wastes computation since it ignores the inherent sparsity of pillars derived from scattered point cloud data. Motivated by recent embedded accelerators with native sparsity support, sparse pillar convolution methods like submanifold convolution (SubM-Conv) aimed to reduce these redundant computations by applying convolution only on active pillars but suffered considerable accuracy loss. Our research identifies that this accuracy loss is due to the restricted fine-grained spatial information flow (fSIF) of SubM-Conv in sparse pillar networks. To overcome this restriction, we propose a selectively dilated (SD-Conv) convolution that evaluates the importance of encoded pillars and selectively dilates the convolution output, enhancing the receptive field for critical pillars and improving object detection accuracy. To facilitate actual acceleration with this novel convolution approach, we designed SPADE+ as a cost-efficient augmentation to existing embedded sparse convolution accelerators. This design supports the SD-Conv without significant demands in area and SRAM size, realizing superior trade-off between the speedup and model accuracy. This strategic enhancement allows our method to achieve extreme pillar sparsity, leading to up to 18.1x computational savings and 16.2x speedup on the embedded accelerators, without compromising object detection accuracy.
Paper Structure (16 sections, 1 equation, 7 figures, 8 tables)

This paper contains 16 sections, 1 equation, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The accuracy and computation trade-off of 3D object detection. The pillar-based baseline, PointPillars lang2019pointpillars, delivers high accuracy but uses redundant computation. Sparse PointPillars vedder2021sparse employs SubM-Conv, reducing computation but losing considerable accuracy. In contrast to FS-Conv chen2022focalconv's inferior trade-off, our selectively dilated convolution (SD-Conv) retains accuracy while cutting computations by 18.1$\times$, promising for embedded 3D object detection.
  • Figure 2: (a) Pillar-based 3D object detection network structure. (b) Feature extraction steps: Backbone, Neck, and Head. (c) Comparison of the receptive field of various sparse convolution operations within a stage: Dense-Conv, SubM/SPS-Conv, FS-Conv, and SD-Conv.
  • Figure 3: An overview of Selective Dilated Convolution.
  • Figure 4: (a) Training curve for SparsePointPillars with SD-Conv employing magnitude-based and trainable-based importance in high sparsity. (b) Performance comparison of the car detection task in SparsePointPillars using different methods to determine the dilation directions of SD-Conv applied to the KITTI dataset.
  • Figure 5: Feature representation of a single car object based on sparse convolution type. "Input Pillars" represents the initial input of the backbone network, while the images corresponding to SubM-Conv, FS-Conv, and SD-Conv are the outputs of the last layer in Stage 2. The white rectangles indicate the boundaries of GT-boxes.
  • ...and 2 more figures