U-BEV: Height-aware Bird's-Eye-View Segmentation and Neural Map-based Relocalization

Andrea Boscolo Camiletto; Alfredo Bochicchio; Alexander Liniger; Dengxin Dai; Abel Gawel

U-BEV: Height-aware Bird's-Eye-View Segmentation and Neural Map-based Relocalization

Andrea Boscolo Camiletto, Alfredo Bochicchio, Alexander Liniger, Dengxin Dai, Abel Gawel

TL;DR

The paper tackles the challenge of reliable image-based relocalization under GPS-denied conditions by introducing U-BEV, a height-aware BEV representation that reasons over multiple height layers before BEV fusion. It couples a lightweight, multi-height encoder–decoder BEV with a neural map encoding of SD-map data and a differentiable template matcher for end-to-end relocalization. U-BEV achieves IoU gains of approximately $1.7$ to $2.8$ over a strong BEV baseline and improves Recall Accuracy at a $10\mathrm{ m}$ threshold by about $26.4\%$ on nuScenes, while maintaining real-time inference with reduced computational load. This approach enables robust relocalization in feature-poor or degenerate environments by leveraging road-shape structure over distinct landmarks, with practical applicability to lightweight autonomous driving systems.

Abstract

Efficient relocalization is essential for intelligent vehicles when GPS reception is insufficient or sensor-based localization fails. Recent advances in Bird's-Eye-View (BEV) segmentation allow for accurate estimation of local scene appearance and in turn, can benefit the relocalization of the vehicle. However, one downside of BEV methods is the heavy computation required to leverage the geometric constraints. This paper presents U-BEV, a U-Net inspired architecture that extends the current state-of-the-art by allowing the BEV to reason about the scene on multiple height layers before flattening the BEV features. We show that this extension boosts the performance of the U-BEV by up to 4.11 IoU. Additionally, we combine the encoded neural BEV with a differentiable template matcher to perform relocalization on neural SD-map data. The model is fully end-to-end trainable and outperforms transformer-based BEV methods of similar computational complexity by 1.7 to 2.8 mIoU and BEV-based relocalization by over 26% Recall Accuracy on the nuScenes dataset.

U-BEV: Height-aware Bird's-Eye-View Segmentation and Neural Map-based Relocalization

TL;DR

over a strong BEV baseline and improves Recall Accuracy at a

threshold by about

on nuScenes, while maintaining real-time inference with reduced computational load. This approach enables robust relocalization in feature-poor or degenerate environments by leveraging road-shape structure over distinct landmarks, with practical applicability to lightweight autonomous driving systems.

Abstract

Paper Structure (14 sections, 3 equations, 6 figures, 2 tables)

This paper contains 14 sections, 3 equations, 6 figures, 2 tables.

INTRODUCTION
RELATED WORK
Bird's-Eye-View Segmentation
Image-based Relocalization
METHOD
Bird's Eye View Reconstruction
Map encoding
Localization
EXPERIMENTS
Dataset
Experimental Setup
Training
Results
CONCLUSIONS

Figures (6)

Figure 1: U-BEV proposes a novel BEV representation from surround-view images for efficient neural relocalization in SD map data.
Figure 2: Overview of the U-BEV Neural Relocalization Model. U-BEV predicts the local BEV from a set of surround cameras. A pretrained encoder extract features from it yielding the Neural BEV (left). The map encoder extracts features from the cropped global SD map based on Location prior $\xi_{init}$ (right) to build the neural map representation. The deep template matching module (QATM) computes the best matching location (center).
Figure 3: Distribution of LiDAR readings reprojected on image planes when written as the height from the ground of the car frame and as the distance from the camera, on nuScenes.
Figure 4: Architecture of the U-BEV Model. (a) The pretrained backbone (in blue) extracts features from all 6 cameras around the car. The first decoder (in orange) predicts the height of each pixel on each input image. This height is used to project features from different cameras into a single BEV (in green). Deeper features get projected to lower-resolution BEVs and are then upsampled in an encoder-decoder fashion with skip connections (in yellow). (b) Illustrates the projection operation from surround view images and heights to different BEV layers.
Figure 5: Sample inputs and outputs of the proposed U-BEV, including surround images, predicted heights, and predicted and ground truth BEV. Compared to CVT, U-BEV more truthfully reconstructs drivable surfaces and sidewalks.
...and 1 more figures

U-BEV: Height-aware Bird's-Eye-View Segmentation and Neural Map-based Relocalization

TL;DR

Abstract

U-BEV: Height-aware Bird's-Eye-View Segmentation and Neural Map-based Relocalization

Authors

TL;DR

Abstract

Table of Contents

Figures (6)