PointBeV: A Sparse Approach to BeV Predictions

Loick Chambon; Eloi Zablocki; Mickael Chen; Florent Bartoccioni; Patrick Perez; Matthieu Cord

PointBeV: A Sparse Approach to BeV Predictions

Loick Chambon, Eloi Zablocki, Mickael Chen, Florent Bartoccioni, Patrick Perez, Matthieu Cord

TL;DR

The paper tackles the inefficiency of dense Bird's-eye View (BeV) representations in camera-based autonomous driving, especially for long temporal contexts. It introduces PointBeV, a sparse BeV segmentation framework that projects sparse BeV points, uses Sparse Feature Pulling to extract camera features only where visible, and employs Submanifold Attention for efficient temporal aggregation. A two-pass coarse/fine training strategy concentrates computation on discriminative regions, and inference supports flexible trade-offs with test-time priors like LiDAR and HD maps. On nuScenes and Lyft L5, PointBeV achieves state-of-the-art IoU for vehicle, pedestrian, and lane segmentation in both static and temporal settings, while offering substantial memory and compute savings. This sparse paradigm enables longer temporal horizons on memory-constrained hardware and paves the way for broader sparse-BeV applications in perception and planning.

Abstract

Bird's-eye View (BeV) representations have emerged as the de-facto shared space in driving applications, offering a unified space for sensor data fusion and supporting various downstream tasks. However, conventional models use grids with fixed resolution and range and face computational inefficiencies due to the uniform allocation of resources across all cells. To address this, we propose PointBeV, a novel sparse BeV segmentation model operating on sparse BeV cells instead of dense grids. This approach offers precise control over memory usage, enabling the use of long temporal contexts and accommodating memory-constrained platforms. PointBeV employs an efficient two-pass strategy for training, enabling focused computation on regions of interest. At inference time, it can be used with various memory/performance trade-offs and flexibly adjusts to new specific use cases. PointBeV achieves state-of-the-art results on the nuScenes dataset for vehicle, pedestrian, and lane segmentation, showcasing superior performance in static and temporal settings despite being trained solely with sparse signals. We will release our code along with two new efficient modules used in the architecture: Sparse Feature Pulling, designed for the effective extraction of features from images to BeV, and Submanifold Attention, which enables efficient temporal modeling. Our code is available at https://github.com/valeoai/PointBeV.

PointBeV: A Sparse Approach to BeV Predictions

TL;DR

Abstract

Paper Structure (27 sections, 1 equation, 13 figures, 17 tables)

This paper contains 27 sections, 1 equation, 13 figures, 17 tables.

Introduction
Related Work
Vision-based BeV Segmentation.
PointBeV
Sparse Feature Propagation
Coarse and fine training
Sparse temporal model
Inference with PointBeV
Experiments
Data, training and implementation details.
State-of-the-art comparison
Ablations
Adaptive Inference Capabilities
Conclusion
Technical Details
...and 12 more sections

Figures (13)

Figure 1: BeV vehicle IoU vs. memory footprint on nuScenes caesar2020nuscenes validation set. Models are evaluated without visibility filtering (i.e all annotated vehicles are considered) at resolution $224 \times 480$. The memory consumption is calculated using a 40GB A100 GPU. The size of a dot represents the number of BeV points being evaluated, the smaller the better. PointBeV has the capacity to explore various trade-offs between efficiency and performance by varying the number of points being considered. The remaining points are considered as zeros in the final prediction. Using PointBeV we can achieve state-of-the-art performance with only a small portion of the points and without losing performance.
Figure 2: PointBeV architecture. As a sparse method, PointBeV is trained using local predictions, only for sampled 2D points provided as inputs. The selection of those points during training and at test time is illustrated in \ref{['fig:coarse_fine']}. The points of interest are lifted to form 3D pillars, with each 3D point pulling visual features. To achieve this, PointBeV incorporates an efficient feature extraction process through a Sparse Feature Pulling module, illustrated in the 'efficient feature extraction' block and further explained in \ref{['sec:model:sparse_feature']} and \ref{['fig:feature_extraction']}. The obtained 3D BeV features are then flattened onto the 2D BeV plane and processed using a sparse U-Net with task-dependent final heads, generating local BeV predictions. For training, we only need sparse signals. At test time, points that have not been sampled are set to zero.
Figure 3: Sparse Feature Pulling and Camera Fusion. 3D BeV points are projected into the localized camera features (left). From there, camera features are bilinearly interpolated to obtain the 3D BeV features at this position (right). Where previous methods project points onto all the cameras regardless of their visibility, or pad the number of points so that there are as many per camera, we conduct feature pulling, for each camera, only on the visible 3D points. If a point is visible to a single camera, the feature pulling is done only within the corresponding feature volume.
Figure 4: Illustration of the 'coarse' and 'fine' passes. Top row: given sampled BeV points, predictions are made at these locations in the 'coarse pass'. We select highest logits points as 'anchors'. Around these anchors, points are densely sampled using a kernel of size $k_\textit{fine} \times k_\textit{fine}$ ($3\times3$ in this vizualisation). Then the 'fine pass' provides predictions for these points. The networks (\ref{['fig:archi']}) are shared between passes, and the camera feature extraction is only done once as the features don't change. This figure illustrates both the training and the inference stages, and we stress non-visible differences between these two. During training, (1) the coarse points are typically randomly sampled from a uniform distribution, and (2), the top $N_\textit{anchor}$ activations are selected as anchors. During inference, (1) the coarse points are sampled using different strategies such as the subsampled pattern (see \ref{['sec:model:inference']}), and (2) points having a score above the threshold $\tau$ are selected as anchors. To evaluate the entire dense BeV, we instead make a single pass with all BeV points. The bottom row displays sampling masks for three different sampling strategies, with the ground-truth vehicles' bounding boxes delineated in black for visualization.
Figure 5: Illustration of the 'Submanifold Temporal Attention' module. Our module performs an attention between a query point (colored in red), at the center of a spatio-temporal neighborhood (red dotted lines and complete parallelepiped). The points inside this neighborhood become the keys and values for the attention mechanism. The points outside are discarded. Consequently, the number of keys and values depends on the number of points present in the vicinity of the query point. More details in \ref{['sec:model:temporal']}.
...and 8 more figures

PointBeV: A Sparse Approach to BeV Predictions

TL;DR

Abstract

PointBeV: A Sparse Approach to BeV Predictions

Authors

TL;DR

Abstract

Table of Contents

Figures (13)