Table of Contents
Fetching ...

BEVSpread: Spread Voxel Pooling for Bird's-Eye-View Representation in Vision-based Roadside 3D Object Detection

Wenjie Wang, Yehao Lu, Guangcong Zheng, Shuigen Zhan, Xiaoqing Ye, Zichang Tan, Jingdong Wang, Gaoang Wang, Xi Li

TL;DR

BEVSpread targets the position-approximation error in voxel pooling used by frustum-based BEV methods for vision-based roadside 3D object detection. It replaces single-grid accumulation with spread pooling to top-k neighboring BEV grids, guided by a depth-dependent Gaussian weight that links distance and depth through a learnable variance, and it achieves comparable inference time using CUDA acceleration. Across DAIR-V2X-I and Rope3D, BEVSpread yields substantial AP gains over BEVDepth and BEVHeight baselines, and also improves nuScenes performance, demonstrating strong plug-in capability and robustness under parameter perturbations. The work shows that depth-aware spread pooling can significantly enhance long-range and small-object detection in roadside perception, with practical impact for safer autonomous driving systems.

Abstract

Vision-based roadside 3D object detection has attracted rising attention in autonomous driving domain, since it encompasses inherent advantages in reducing blind spots and expanding perception range. While previous work mainly focuses on accurately estimating depth or height for 2D-to-3D mapping, ignoring the position approximation error in the voxel pooling process. Inspired by this insight, we propose a novel voxel pooling strategy to reduce such error, dubbed BEVSpread. Specifically, instead of bringing the image features contained in a frustum point to a single BEV grid, BEVSpread considers each frustum point as a source and spreads the image features to the surrounding BEV grids with adaptive weights. To achieve superior propagation performance, a specific weight function is designed to dynamically control the decay speed of the weights according to distance and depth. Aided by customized CUDA parallel acceleration, BEVSpread achieves comparable inference time as the original voxel pooling. Extensive experiments on two large-scale roadside benchmarks demonstrate that, as a plug-in, BEVSpread can significantly improve the performance of existing frustum-based BEV methods by a large margin of (1.12, 5.26, 3.01) AP in vehicle, pedestrian and cyclist.

BEVSpread: Spread Voxel Pooling for Bird's-Eye-View Representation in Vision-based Roadside 3D Object Detection

TL;DR

BEVSpread targets the position-approximation error in voxel pooling used by frustum-based BEV methods for vision-based roadside 3D object detection. It replaces single-grid accumulation with spread pooling to top-k neighboring BEV grids, guided by a depth-dependent Gaussian weight that links distance and depth through a learnable variance, and it achieves comparable inference time using CUDA acceleration. Across DAIR-V2X-I and Rope3D, BEVSpread yields substantial AP gains over BEVDepth and BEVHeight baselines, and also improves nuScenes performance, demonstrating strong plug-in capability and robustness under parameter perturbations. The work shows that depth-aware spread pooling can significantly enhance long-range and small-object detection in roadside perception, with practical impact for safer autonomous driving systems.

Abstract

Vision-based roadside 3D object detection has attracted rising attention in autonomous driving domain, since it encompasses inherent advantages in reducing blind spots and expanding perception range. While previous work mainly focuses on accurately estimating depth or height for 2D-to-3D mapping, ignoring the position approximation error in the voxel pooling process. Inspired by this insight, we propose a novel voxel pooling strategy to reduce such error, dubbed BEVSpread. Specifically, instead of bringing the image features contained in a frustum point to a single BEV grid, BEVSpread considers each frustum point as a source and spreads the image features to the surrounding BEV grids with adaptive weights. To achieve superior propagation performance, a specific weight function is designed to dynamically control the decay speed of the weights according to distance and depth. Aided by customized CUDA parallel acceleration, BEVSpread achieves comparable inference time as the original voxel pooling. Extensive experiments on two large-scale roadside benchmarks demonstrate that, as a plug-in, BEVSpread can significantly improve the performance of existing frustum-based BEV methods by a large margin of (1.12, 5.26, 3.01) AP in vehicle, pedestrian and cyclist.
Paper Structure (24 sections, 8 equations, 10 figures, 11 tables, 1 algorithm)

This paper contains 24 sections, 8 equations, 10 figures, 11 tables, 1 algorithm.

Figures (10)

  • Figure 1: The overall framework of BEVSpread. Spread voxel pooling consists of two main steps, Neighbor Selection and Weight Calculation. First, each 3D geometry point $p$ is mapped to BEV space, where $top-k$ nearest BEV grid centers are selected as its neighbors $\Omega_{p,k}$. Correspondingly, the original voxel pooling selects the $top-1$ nearest BEV grid center as its neighbor $\Omega_{p,1}$. Second, the weights are calculated for the neighbors by Weight Function, where the weights $\omega_{p,\hat{p}}$ and the distances $d_{p,\hat{p}}$ follow a Gaussian distribution with $(0, \sigma^2)$. Furthermore, the variance $\sigma^2$ is positively related to depth $D_p$, which controls the decay speed of $\omega_{p,\hat{p}}$. Ultimately, the image features contained in each 3D geometry point are accumulated to its neighbors according to the calculated weights.
  • Figure 2: Effect of depth in voxel pooling. Same size image blocks with deeper depth represent objects of larger 3D scales, which results in distant objects containing few image features. Therefore, it is reasonable to assign larger weights to the surrounding BEV grids for the distant targets.
  • Figure 3: Visualization results of BEVHeight and our proposed BEVSpread in image and BEV view. It can be observed in the upper half that BEVSpread detects the targets which BEVHeight have not detected in multiple scenes. The lower half demonstrates the reasons. We notice that BEVHeight misses the pedestrian because no corresponding image features are projected onto the correct BEV grids. However, BEVSpread spreads the image features to the surrounding BEV grids and thus successfully detects the target.
  • Figure 4: Proof Experiment for Position Recovery. Spread voxel pooling recovers the random point position with 0.003 MSE loss when the neighbors number $k \geq 3$, while the original voxel pooling ($k=1$) obtains 0.095 MSE loss.
  • Figure 5: Hyperparameter sensitivity experiment on neighbors number $k$. It can be observed that the performance of $k \geq 2$ is significantly better than $k=1$ (baseline). As $k$ increases, the performance gradually improves and becomes stable.
  • ...and 5 more figures