Table of Contents
Fetching ...

NV3D: Leveraging Spatial Shape Through Normal Vector-based 3D Object Detection

Krittin Chaowakarn, Paramin Sangwongngam, Nang Htet Htet Aung, Chalie Charoenlarpnopparut

TL;DR

NV3D tackles the challenge of efficient 3D object detection from LiDAR by introducing normal-vector features computed from local voxel neighborhoods via a KNN+PCA pipeline. It proposes two sampling strategies—normal-vector density-based and FOV-aware bin-based sampling—and fuses these features with voxel descriptors through an element-wise attention mechanism, all atop a Voxel R-CNN backbone. On KITTI, NV3D yields consistent gains for cars and cyclists, and achieves substantial data reduction (up to $55\%$ fewer voxels) with minimal loss in accuracy. However, pedestrian detection remains harder, likely due to the non-flat geometry of humans, highlighting an area for future refinement in normal-vector estimation. Overall, the approach demonstrates that explicit surface-normal information and targeted sampling can boost local geometric reasoning while reducing computation in voxel-based 3D detection systems.

Abstract

Recent studies in 3D object detection for autonomous vehicles aim to enrich features through the utilization of multi-modal setups or the extraction of local patterns within LiDAR point clouds. However, multi-modal methods face significant challenges in feature alignment, and gaining features locally can be oversimplified for complex 3D object detection tasks. In this paper, we propose a novel model, NV3D, which utilizes local features acquired from voxel neighbors, as normal vectors computed per voxel basis using K-nearest neighbors (KNN) and principal component analysis (PCA). This informative feature enables NV3D to determine the relationship between the surface and pertinent target entities, including cars, pedestrians, or cyclists. During the normal vector extraction process, NV3D offers two distinct sampling strategies: normal vector density-based sampling and FOV-aware bin-based sampling, allowing elimination of up to 55% of data while maintaining performance. In addition, we applied element-wise attention fusion, which accepts voxel features as the query and value and normal vector features as the key, similar to the attention mechanism. Our method is trained on the KITTI dataset and has demonstrated superior performance in car and cyclist detection owing to their spatial shapes. In the validation set, NV3D without sampling achieves 86.60% and 80.18% mean Average Precision (mAP), greater than the baseline Voxel R-CNN by 2.61% and 4.23% mAP, respectively. With both samplings, NV3D achieves 85.54% mAP in car detection, exceeding the baseline by 1.56% mAP, despite roughly 55% of voxels being filtered out.

NV3D: Leveraging Spatial Shape Through Normal Vector-based 3D Object Detection

TL;DR

NV3D tackles the challenge of efficient 3D object detection from LiDAR by introducing normal-vector features computed from local voxel neighborhoods via a KNN+PCA pipeline. It proposes two sampling strategies—normal-vector density-based and FOV-aware bin-based sampling—and fuses these features with voxel descriptors through an element-wise attention mechanism, all atop a Voxel R-CNN backbone. On KITTI, NV3D yields consistent gains for cars and cyclists, and achieves substantial data reduction (up to fewer voxels) with minimal loss in accuracy. However, pedestrian detection remains harder, likely due to the non-flat geometry of humans, highlighting an area for future refinement in normal-vector estimation. Overall, the approach demonstrates that explicit surface-normal information and targeted sampling can boost local geometric reasoning while reducing computation in voxel-based 3D detection systems.

Abstract

Recent studies in 3D object detection for autonomous vehicles aim to enrich features through the utilization of multi-modal setups or the extraction of local patterns within LiDAR point clouds. However, multi-modal methods face significant challenges in feature alignment, and gaining features locally can be oversimplified for complex 3D object detection tasks. In this paper, we propose a novel model, NV3D, which utilizes local features acquired from voxel neighbors, as normal vectors computed per voxel basis using K-nearest neighbors (KNN) and principal component analysis (PCA). This informative feature enables NV3D to determine the relationship between the surface and pertinent target entities, including cars, pedestrians, or cyclists. During the normal vector extraction process, NV3D offers two distinct sampling strategies: normal vector density-based sampling and FOV-aware bin-based sampling, allowing elimination of up to 55% of data while maintaining performance. In addition, we applied element-wise attention fusion, which accepts voxel features as the query and value and normal vector features as the key, similar to the attention mechanism. Our method is trained on the KITTI dataset and has demonstrated superior performance in car and cyclist detection owing to their spatial shapes. In the validation set, NV3D without sampling achieves 86.60% and 80.18% mean Average Precision (mAP), greater than the baseline Voxel R-CNN by 2.61% and 4.23% mAP, respectively. With both samplings, NV3D achieves 85.54% mAP in car detection, exceeding the baseline by 1.56% mAP, despite roughly 55% of voxels being filtered out.

Paper Structure

This paper contains 15 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: The architecture of NV3D for 3D object detection. The LiDAR point clouds are first divided into voxels. These voxel features are used to compute normal vector features, as well as sampling masks through the normal vector extraction module. Sampling masks are applied before feeding voxel and normal features into element-wise attention fusion to merge two features. The merged feature is continued in Voxel R-CNN-based architecture for a complete 3D object detection model.
  • Figure 2: (a) Normal vectors directly extracted from points inside each voxel. (b) Normal vectors extracted from surrounding voxels.
  • Figure 3: Visualization of normal vector density-based sampling followed by FOV-aware bin-based sampling showing the continuous density of voxel feature.
  • Figure 4: Visualization of normal vectors for the same frame with varying drop rates, considering only normalized normal vector densities greater than 0.7: (a) original plot (no drop) (b) 50% drop rate (c) 100% drop rate.
  • Figure 5: (a) FOV-aware sampling: $\text{n}^\text{th}$ bin contains $500 \times (2n - 1)$ points. (b) Visualization of FOV-aware bin-based drop showing the more consistency of voxel feature density.
  • ...and 6 more figures