Table of Contents
Fetching ...

BEVRender: Vision-based Cross-view Vehicle Registration in Off-road GNSS-denied Environment

Lihong Jin, Wei Dong, Wenshan Wang, Michael Kaess

TL;DR

BEVRender addresses the problem of GNSS-denied off-road vehicle localization where lack of distinct landmarks and GNSS outages hinder vision-based approaches. It introduces a learning-based pipeline that synthesizes local BEV images from multi-view camera data using a BEVFormer-inspired feature encoder with deformable attention, followed by a CNN-based BEV rendering head. These local BEV images are registered to a geo-referenced aerial map via NCC-based template matching to achieve accurate 2D localization while reducing the storage burden typical of image-retrieval methods. Real-world experiments on Pittsburgh data show improved localization accuracy and update frequency, with ablations and cross-sequence tests demonstrating robustness and generalization to unseen trajectories. The approach offers a practical, scalable solution for online GNSS-denied localization in off-road environments, potentially enabling more reliable autonomous operation where GPS is unavailable or unreliable.

Abstract

We introduce BEVRender, a novel learning based approach for the localization of ground vehicles in Global Navigation Satellite System(GNSS)-denied off-road scenarios. These environments are typically challenging for conventional vision-based state estimation due to the lack of distinct visual landmarks and the instability of vehicle poses. To address this, BEVRender generates high-quality local bird's-eye-view(BEV) images of the local terrain. Subsequently, these images are aligned with a geo referenced aerial map through template matching to achieve accurate cross-view registration. Our approach overcomes the inherent limitations of visual inertial odometry systems and the substantial storage requirements of image-retrieval localization strategies, which are susceptible to drift and scalability issues, respectively. Extensive experimentation validates BEVRender's advancement over existing GNSS-denied visual localization methods, demonstrating notable enhancements in both localization accuracy and update frequency.

BEVRender: Vision-based Cross-view Vehicle Registration in Off-road GNSS-denied Environment

TL;DR

BEVRender addresses the problem of GNSS-denied off-road vehicle localization where lack of distinct landmarks and GNSS outages hinder vision-based approaches. It introduces a learning-based pipeline that synthesizes local BEV images from multi-view camera data using a BEVFormer-inspired feature encoder with deformable attention, followed by a CNN-based BEV rendering head. These local BEV images are registered to a geo-referenced aerial map via NCC-based template matching to achieve accurate 2D localization while reducing the storage burden typical of image-retrieval methods. Real-world experiments on Pittsburgh data show improved localization accuracy and update frequency, with ablations and cross-sequence tests demonstrating robustness and generalization to unseen trajectories. The approach offers a practical, scalable solution for online GNSS-denied localization in off-road environments, potentially enabling more reliable autonomous operation where GPS is unavailable or unreliable.

Abstract

We introduce BEVRender, a novel learning based approach for the localization of ground vehicles in Global Navigation Satellite System(GNSS)-denied off-road scenarios. These environments are typically challenging for conventional vision-based state estimation due to the lack of distinct visual landmarks and the instability of vehicle poses. To address this, BEVRender generates high-quality local bird's-eye-view(BEV) images of the local terrain. Subsequently, these images are aligned with a geo referenced aerial map through template matching to achieve accurate cross-view registration. Our approach overcomes the inherent limitations of visual inertial odometry systems and the substantial storage requirements of image-retrieval localization strategies, which are susceptible to drift and scalability issues, respectively. Extensive experimentation validates BEVRender's advancement over existing GNSS-denied visual localization methods, demonstrating notable enhancements in both localization accuracy and update frequency.
Paper Structure (16 sections, 11 equations, 6 figures, 5 tables)

This paper contains 16 sections, 11 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: A diagram of our system. The light blue background indicates the training phase and the light green background indicates the testing phase. During the training phase, camera images are patch projected and sent to the feature encoder (in blue) and rendering head (in orange) to generate BEV images (highlighted in yellow boxes). The aerial map image is rotated and cropped according to the GPS information provided, ensuring that the final label image accurately represents the BEV space surrounding the vehicle. During the testing phase, the rendered BEV image is rotated according to the azimuth angle provided by the GPS, and matched against a local search region surrounding the vehicle position.
  • Figure 2: Encoder layer architecture. An encoder layer is composed of temporal and spatial attention. In temporal attention, a set of 2D reference points with a spatial dimension of $l\times w$ is sampled and deformed. Next, bilinear sampling is performed to extract tokens for multi-head attention (MHA) vaswani2017attention given deformed reference points from previous timestamp BEV feature $B_{t-1}$. The MHA output from temporal attention serves as a query for the subsequent spatial attention module. In spatial attention, we sample one point per 3D grid cell in the BEV space as reference points and project them to the three camera image frames with extrinsic and intrinsic parameters to obtain 2D reference points for each image view. Similarly to temporal attention, the 2D reference points are deformed and used for bilinear sampling, but from camera feature. A more detailed description can be found in Sec. \ref{['encoder_layer_description']}.
  • Figure 3: Temporal feature propagation and dataset organization. For each timestamp, we sample $n$ frames from past $T$ seconds, composing a training sample of $n\!+\!1$ camera frames together with current timestamp frame. Staring with the earliest timestamp in the window, BEV query $Q$ is used to query camera feature $F$ to obtain BEV feature $B$, which is subsequently projected to next timestamp vehicle position given GPS outputs, to obtain new feature $B'$. Propagation continues until the latest frame is processed. A detailed description on projection can be found in Sec. \ref{['encoder_layer_description']}.
  • Figure 4: Qualitative comparison of our method and Litman Litman. Top row: the rendering and registration result of our method, where the BEV images are highlighted in yellow boxes, the red dots indicate the NCC predictions from our system, and the blue dots indicate the GPS ground truth position. Our approach produces coherent rendering to the aerial image. Bottom row: predictions from Litman Litman. Similarly, the red and blue dots indicate the predictions and ground truth, while the yellow boxes indicate the generated occupancy image overlaid on the groundtruth. Only semi-dense rendering are available for Litman Litman (see the saturated white and green points around red dots), resulting in compromised registration accuracy.
  • Figure 5: Trajectory plot for cross-sequence testing. Sequence 3 and 8 are used in training, sequence 4 to 7 are used in testing.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Remark 1: Testing with Litman Litman
  • Remark 2: Testing with GeoDTR