Table of Contents
Fetching ...

Fine-Grained Pillar Feature Encoding Via Spatio-Temporal Virtual Grid for 3D Object Detection

Konyul Park, Yecheol Kim, Junho Koh, Byungwoo Park, Jun Won Choi

TL;DR

The paper addresses the accuracy gap in pillar-based LiDAR detectors by focusing on fine-grained within-pillar point distributions. It introduces FG-PFE, which uses Spatio-Temporal Virtual grids to encode vertical, temporal, and horizontal distributions via V-PFE, T-PFE, and H-PFE, fused by Attentive Pillar Aggregation. An auxiliary objectness score loss and a modified group head further improve training and discrimination. On nuScenes, FG-PFE delivers consistent gains over strong pillar-based baselines with minimal latency increase, supporting real-time deployment in autonomous systems.

Abstract

Developing high-performance, real-time architectures for LiDAR-based 3D object detectors is essential for the successful commercialization of autonomous vehicles. Pillar-based methods stand out as a practical choice for onboard deployment due to their computational efficiency. However, despite their efficiency, these methods can sometimes underperform compared to alternative point encoding techniques such as Voxel-encoding or PointNet++. We argue that current pillar-based methods have not sufficiently captured the fine-grained distributions of LiDAR points within each pillar structure. Consequently, there exists considerable room for improvement in pillar feature encoding. In this paper, we introduce a novel pillar encoding architecture referred to as Fine-Grained Pillar Feature Encoding (FG-PFE). FG-PFE utilizes Spatio-Temporal Virtual (STV) grids to capture the distribution of point clouds within each pillar across vertical, temporal, and horizontal dimensions. Through STV grids, points within each pillar are individually encoded using Vertical PFE (V-PFE), Temporal PFE (T-PFE), and Horizontal PFE (H-PFE). These encoded features are then aggregated through an Attentive Pillar Aggregation method. Our experiments conducted on the nuScenes dataset demonstrate that FG-PFE achieves significant performance improvements over baseline models such as PointPillar, CenterPoint-Pillar, and PillarNet, with only a minor increase in computational overhead.

Fine-Grained Pillar Feature Encoding Via Spatio-Temporal Virtual Grid for 3D Object Detection

TL;DR

The paper addresses the accuracy gap in pillar-based LiDAR detectors by focusing on fine-grained within-pillar point distributions. It introduces FG-PFE, which uses Spatio-Temporal Virtual grids to encode vertical, temporal, and horizontal distributions via V-PFE, T-PFE, and H-PFE, fused by Attentive Pillar Aggregation. An auxiliary objectness score loss and a modified group head further improve training and discrimination. On nuScenes, FG-PFE delivers consistent gains over strong pillar-based baselines with minimal latency increase, supporting real-time deployment in autonomous systems.

Abstract

Developing high-performance, real-time architectures for LiDAR-based 3D object detectors is essential for the successful commercialization of autonomous vehicles. Pillar-based methods stand out as a practical choice for onboard deployment due to their computational efficiency. However, despite their efficiency, these methods can sometimes underperform compared to alternative point encoding techniques such as Voxel-encoding or PointNet++. We argue that current pillar-based methods have not sufficiently captured the fine-grained distributions of LiDAR points within each pillar structure. Consequently, there exists considerable room for improvement in pillar feature encoding. In this paper, we introduce a novel pillar encoding architecture referred to as Fine-Grained Pillar Feature Encoding (FG-PFE). FG-PFE utilizes Spatio-Temporal Virtual (STV) grids to capture the distribution of point clouds within each pillar across vertical, temporal, and horizontal dimensions. Through STV grids, points within each pillar are individually encoded using Vertical PFE (V-PFE), Temporal PFE (T-PFE), and Horizontal PFE (H-PFE). These encoded features are then aggregated through an Attentive Pillar Aggregation method. Our experiments conducted on the nuScenes dataset demonstrate that FG-PFE achieves significant performance improvements over baseline models such as PointPillar, CenterPoint-Pillar, and PillarNet, with only a minor increase in computational overhead.
Paper Structure (20 sections, 6 equations, 2 figures, 4 tables)

This paper contains 20 sections, 6 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Performance versus latency of several 3D object detectors evaluated on nuScenes val split: CP-Pillars denotes CenterPoint centerpoint with PointPillars backbone. Latency is measured with a single NVIDIA TITAN RTX GPU. When incorporated into various baselines such as PointPillars pointpillars, CenterPoint centerpoint, and PillarNet pillarnet, FG-PFE delivers substantial performance gains with small computational overhead.
  • Figure 2: Overall architecture of the proposed FG-PFE. LiDAR points are quantized along the vertical, temporal, and horizontal axes. In V-PFE, voxels from vertical axis are aggregated by the vertical grid attention module. In T-PFE, voxels from temporal axis are processed using a set of MLPs. In H-PFE, voxels from two different horizontal grid are transformed back into LiDAR points and then combined using concatenation. LiDAR points are then converted into pillar features with original pillar feature encoding. Pillar features originating from the three axes are combined using the Attentive Pillar Aggregation module.