Table of Contents
Fetching ...

PDM-SSD: Single-Stage Three-Dimensional Object Detector With Point Dilation

Ao Liang, Haiyang Hua, Jian Fang, Wenyu Chen, Huaici Zhao

TL;DR

PDM-SSD tackles the limited receptive field of point-based 3D detectors by introducing a Point Dilation Mechanism that lifts sampled points onto a 2D grid and fills unoccupied space using angular and scale information derived from spherical harmonics and Gaussian densities. A PointNet-style backbone provides efficient per-point features, while the neck expands the feature space and a hybrid head jointly learns from dilated grid features and point-wise context. On KITTI, PDM-SSD achieves state-of-the-art performance among single-stage point-based detectors with fast inference (~68 FPS) and demonstrates robustness for sparse and incomplete objects, with auxiliary PDM further boosting accuracy without speed loss. The method balances accuracy and deployment practicality, offering a scalable approach to 3D detection in autonomous driving and related applications. $L_{all}=L_{sample}+L_{p}+L_{heatmap}+L_2$, with $L_p=L_{vote}+L_{cls}+L_{reg}$ and $L_{reg}=L_{loc}+L_{size}+L_{angle-bin}+L_{angle-res}+L_{corner}$, and uses $Mask_i$ in $L_{sample}$ to emphasize central points, guiding robust learning for sparse targets.

Abstract

Current Point-based detectors can only learn from the provided points, with limited receptive fields and insufficient global learning capabilities for such targets. In this paper, we present a novel Point Dilation Mechanism for single-stage 3D detection (PDM-SSD) that takes advantage of these two representations. Specifically, we first use a PointNet-style 3D backbone for efficient feature encoding. Then, a neck with Point Dilation Mechanism (PDM) is used to expand the feature space, which involves two key steps: point dilation and feature filling. The former expands points to a certain size grid centered around the sampled points in Euclidean space. The latter fills the unoccupied grid with feature for backpropagation using spherical harmonic coefficients and Gaussian density function in terms of direction and scale. Next, we associate multiple dilation centers and fuse coefficients to obtain sparse grid features through height compression. Finally, we design a hybrid detection head for joint learning, where on one hand, the scene heatmap is predicted to complement the voting point set for improved detection accuracy, and on the other hand, the target probability of detected boxes are calibrated through feature fusion. On the challenging Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset, PDM-SSD achieves state-of-the-art results for multi-class detection among single-modal methods with an inference speed of 68 frames. We also demonstrate the advantages of PDM-SSD in detecting sparse and incomplete objects through numerous object-level instances. Additionally, PDM can serve as an auxiliary network to establish a connection between sampling points and object centers, thereby improving the accuracy of the model without sacrificing inference speed. Our code will be available at https://github.com/AlanLiangC/PDM-SSD.git.

PDM-SSD: Single-Stage Three-Dimensional Object Detector With Point Dilation

TL;DR

PDM-SSD tackles the limited receptive field of point-based 3D detectors by introducing a Point Dilation Mechanism that lifts sampled points onto a 2D grid and fills unoccupied space using angular and scale information derived from spherical harmonics and Gaussian densities. A PointNet-style backbone provides efficient per-point features, while the neck expands the feature space and a hybrid head jointly learns from dilated grid features and point-wise context. On KITTI, PDM-SSD achieves state-of-the-art performance among single-stage point-based detectors with fast inference (~68 FPS) and demonstrates robustness for sparse and incomplete objects, with auxiliary PDM further boosting accuracy without speed loss. The method balances accuracy and deployment practicality, offering a scalable approach to 3D detection in autonomous driving and related applications. , with and , and uses in to emphasize central points, guiding robust learning for sparse targets.

Abstract

Current Point-based detectors can only learn from the provided points, with limited receptive fields and insufficient global learning capabilities for such targets. In this paper, we present a novel Point Dilation Mechanism for single-stage 3D detection (PDM-SSD) that takes advantage of these two representations. Specifically, we first use a PointNet-style 3D backbone for efficient feature encoding. Then, a neck with Point Dilation Mechanism (PDM) is used to expand the feature space, which involves two key steps: point dilation and feature filling. The former expands points to a certain size grid centered around the sampled points in Euclidean space. The latter fills the unoccupied grid with feature for backpropagation using spherical harmonic coefficients and Gaussian density function in terms of direction and scale. Next, we associate multiple dilation centers and fuse coefficients to obtain sparse grid features through height compression. Finally, we design a hybrid detection head for joint learning, where on one hand, the scene heatmap is predicted to complement the voting point set for improved detection accuracy, and on the other hand, the target probability of detected boxes are calibrated through feature fusion. On the challenging Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset, PDM-SSD achieves state-of-the-art results for multi-class detection among single-modal methods with an inference speed of 68 frames. We also demonstrate the advantages of PDM-SSD in detecting sparse and incomplete objects through numerous object-level instances. Additionally, PDM can serve as an auxiliary network to establish a connection between sampling points and object centers, thereby improving the accuracy of the model without sacrificing inference speed. Our code will be available at https://github.com/AlanLiangC/PDM-SSD.git.

Paper Structure

This paper contains 20 sections, 20 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: (a) The basic structure of the Grid-based 3D detector. P/VFE means Pillar/Voxel Feature Encoder. This approach allows for obtaining dense feature maps from sparse point clouds as input. (b) The Point-based detector has a basic structure where sparse point clouds are inputted and undergo multiple stages of downsampling, feature learning, and local feature aggregation to obtain sparser point-wise features. (c) The basic structure of our PDM-SSD. Point-wise features obtained from the point-based 3D backbone are lifted to the grid level through PDM. This joint learning approach helps alleviate the limited receptive field problem in (b).
  • Figure 2: The overall workflow of PDM-SSD. In the joint training phase, the input LiDAR point clouds are first passed through the embedding network to expand the feature space of the points. Then, a PointNet-style 3D backbone network is utilized to extract features for each point. This 3D backbone network consists of several stages of downsampling modules, local feature aggregation modules, and multi-scale feature aggregation modules. The neck network includes our proposed point dilation mechanism, where points are lifted to the grid level and feature filling is performed for the unoccupied space in the original point cloud using a special mechanism. The grid-wise features are then used to regress the heatmap of the scene, which provides information about the target's position, and the grid features are jointly learned with the fusion detection head to learn the global features of the target. In the auxiliary training phase, we do not utilize the information provided by neck but only compute the prediction loss of heatmap.
  • Figure 3: Visualization on some very sparse and extremely incomplete targets on the KITTI dataset. For grid-based backbone networks, the grid continuously pads, convolves, and pools the operations, covering the space that the original point cloud does not occupy. The expansion of the receptive field is continuous and can better aggregate local features and combine features from different regions. Point-based methods can only extract features from existing points, and even if the number of surrounding points increases, the features remain unchanged. The receptive field is discontinuous and limited to local areas.
  • Figure 4: Point dilation operation. The point cloud is first projected onto a 2D binary occupancy grid and then dilated with a structural element. The new feature map covers many areas that were not occupied by the original point cloud, especially the region where the target box is located (blue box). The feature at the center position is of great interest to the detector.
  • Figure 5: Feature Filling operation. We propose a feature filling method based on spatial separation coefficient. We use point-wise feature learning for Angle Coefficient and Scale Coefficient. The former is achieved by the superposition of spherical harmonics, while the latter is achieved by Gaussian probability density function. The new feature is the weighted sum of the inflated center feature and these two coefficients.
  • ...and 5 more figures