Table of Contents
Fetching ...

Hybrid Pooling and Convolutional Network for Improving Accuracy and Training Convergence Speed in Object Detection

Shiwen Zhao, Wei Wang, Junhui Hou, Hai Wu

TL;DR

The paper tackles the dual challenges of accuracy and training convergence in voxel-based 3D object detection. It introduces HPC-Net, a multimodal detector with three innovations: Replaceable Pooling (RP) for flexible 3D/2D pooling, Depth Accelerated Convergence Convolution (DACConv) to speed up training without sacrificing accuracy, and MEFEM to expand receptive fields and fuse multi-scale features for occluded/truncated objects. Evaluations on KITTI (and supplementary Waymo data) show state-of-the-art Car 2D results and competitive Car 3D performance, with substantial ablations confirming each component’s contribution to speed and accuracy. The approach, built on a Voxel-RCNN backbone with PENet virtual points, offers practical benefits for autonomous driving by reducing training time while delivering high-precision object detection in challenging scenarios.

Abstract

This paper introduces HPC-Net, a high-precision and rapidly convergent object detection network.

Hybrid Pooling and Convolutional Network for Improving Accuracy and Training Convergence Speed in Object Detection

TL;DR

The paper tackles the dual challenges of accuracy and training convergence in voxel-based 3D object detection. It introduces HPC-Net, a multimodal detector with three innovations: Replaceable Pooling (RP) for flexible 3D/2D pooling, Depth Accelerated Convergence Convolution (DACConv) to speed up training without sacrificing accuracy, and MEFEM to expand receptive fields and fuse multi-scale features for occluded/truncated objects. Evaluations on KITTI (and supplementary Waymo data) show state-of-the-art Car 2D results and competitive Car 3D performance, with substantial ablations confirming each component’s contribution to speed and accuracy. The approach, built on a Voxel-RCNN backbone with PENet virtual points, offers practical benefits for autonomous driving by reducing training time while delivering high-precision object detection in challenging scenarios.

Abstract

This paper introduces HPC-Net, a high-precision and rapidly convergent object detection network.
Paper Structure (12 sections, 8 equations, 5 figures, 4 tables)

This paper contains 12 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overall architecture of HPC-Net. (a) Replaceable Pooling. 3D Replaceable Pooling elevates the voxel feature tensor by one dimension before pooling and then converts the tensor dimension into the initial input dimension. 2D Replaceable Pooling's principle is the same, but the difference is to convert 2D images into 3D models and then convert them back. (b) Depth Accelerated Convergence Convolution. The DACConv kernel is used to convolve voxel feature tensors from different channels and feature maps. (c) Multi-Scale Extended Receptive Field Feature Extraction Module. Extract 3D features of objects through multi-layer Deformable Convolution Alpher27 and RROI (replaceable region of interest) pooling, and then output the results through multi-scale feature fusion network.
  • Figure 2: Replaceable Pooling. (a) Through dimensionality increase, pooling, and dimensionality reduction, these three steps pool input feature tensors in both 3D and 2D dimensions to improve accuracy and speed. (b) The previous method. Through comparison, it can be seen that our method is more concise and efficient in structure.
  • Figure 3: Depth Accelerated Convergence Convolution(̇a) Input tensor format as $\mathit{C_{in}*H*W}$. We are reshaping the input tensor as $\mathit{C_{in}*(H \times W)}$. (b) Multiply the channel convolution kernel $\mathit{(C_ {in} * D * H *W)}$ by the feature map convolution kernel $\mathit{(C_ {in} * C_ {out} * H * W)}$ to generate a DACConv kernel(̇c) Convolve input tensor $\mathit{(C_{in} * (H \times W))}$ with DACConv Kernel $\mathit{(C_{in} * C_{out} * (H \times W))}$. $\mathit{C_{in}}$ indicates the index of the input channel. $\mathit{C_{out}}$ indicates the index of the output channel.
  • Figure 4: Multi-Scale Extended Receptive Field Feature Extraction Module. (a) Extending Area Convolution. The EAConv block consists of two layers of Dconv and RROI pooling, greatly increasing the receptive field area. Dconv representative Deformable Convolution Alpher27. (b) Multi-scale Feature Fusion Network. The input is the output of EAConv, fusing features from multiple scales as the output.
  • Figure 5: Apply only Depth Accelerated Convergence Convolution. After applying Depth Accelerated Convergence Convolution, TED-M reached convergence at the 15th cycle. Before applying Depth Accelerated Convergence Convolution, TED-M reaches convergence at the 27th cycle.