Table of Contents
Fetching ...

DiffVL: Diffusion-Based Visual Localization on 2D Maps via BEV-Conditioned GPS Denoising

Li Gao, Hongyang Sun, Liu Liu, Yunhao Li, Yang Cai

TL;DR

DiffVL redefines visual localization by treating noisy GPS data as a denoising target conditioned on BEV visual features and SD-map priors. It jointly learns trajectory refinement via diffusion and geometric consistency via BEV-map alignment, enabling sub-meter pose accuracy without HD maps. The method demonstrates state-of-the-art performance on KITTI, MGL, and nuScenes, indicating strong generalization across urban environments and camera types. This work suggests diffusion models can unlock scalable, GPS-informed localization that reduces reliance on costly HD maps, with broad implications for autonomous driving and robotics.

Abstract

Accurate visual localization is crucial for autonomous driving, yet existing methods face a fundamental dilemma: While high-definition (HD) maps provide high-precision localization references, their costly construction and maintenance hinder scalability, which drives research toward standard-definition (SD) maps like OpenStreetMap. Current SD-map-based approaches primarily focus on Bird's-Eye View (BEV) matching between images and maps, overlooking a ubiquitous signal-noisy GPS. Although GPS is readily available, it suffers from multipath errors in urban environments. We propose DiffVL, the first framework to reformulate visual localization as a GPS denoising task using diffusion models. Our key insight is that noisy GPS trajectory, when conditioned on visual BEV features and SD maps, implicitly encode the true pose distribution, which can be recovered through iterative diffusion refinement. DiffVL, unlike prior BEV-matching methods (e.g., OrienterNet) or transformer-based registration approaches, learns to reverse GPS noise perturbations by jointly modeling GPS, SD map, and visual signals, achieving sub-meter accuracy without relying on HD maps. Experiments on multiple datasets demonstrate that our method achieves state-of-the-art accuracy compared to BEV-matching baselines. Crucially, our work proves that diffusion models can enable scalable localization by treating noisy GPS as a generative prior-making a paradigm shift from traditional matching-based methods.

DiffVL: Diffusion-Based Visual Localization on 2D Maps via BEV-Conditioned GPS Denoising

TL;DR

DiffVL redefines visual localization by treating noisy GPS data as a denoising target conditioned on BEV visual features and SD-map priors. It jointly learns trajectory refinement via diffusion and geometric consistency via BEV-map alignment, enabling sub-meter pose accuracy without HD maps. The method demonstrates state-of-the-art performance on KITTI, MGL, and nuScenes, indicating strong generalization across urban environments and camera types. This work suggests diffusion models can unlock scalable, GPS-informed localization that reduces reliance on costly HD maps, with broad implications for autonomous driving and robotics.

Abstract

Accurate visual localization is crucial for autonomous driving, yet existing methods face a fundamental dilemma: While high-definition (HD) maps provide high-precision localization references, their costly construction and maintenance hinder scalability, which drives research toward standard-definition (SD) maps like OpenStreetMap. Current SD-map-based approaches primarily focus on Bird's-Eye View (BEV) matching between images and maps, overlooking a ubiquitous signal-noisy GPS. Although GPS is readily available, it suffers from multipath errors in urban environments. We propose DiffVL, the first framework to reformulate visual localization as a GPS denoising task using diffusion models. Our key insight is that noisy GPS trajectory, when conditioned on visual BEV features and SD maps, implicitly encode the true pose distribution, which can be recovered through iterative diffusion refinement. DiffVL, unlike prior BEV-matching methods (e.g., OrienterNet) or transformer-based registration approaches, learns to reverse GPS noise perturbations by jointly modeling GPS, SD map, and visual signals, achieving sub-meter accuracy without relying on HD maps. Experiments on multiple datasets demonstrate that our method achieves state-of-the-art accuracy compared to BEV-matching baselines. Crucially, our work proves that diffusion models can enable scalable localization by treating noisy GPS as a generative prior-making a paradigm shift from traditional matching-based methods.

Paper Structure

This paper contains 19 sections, 12 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the proposed DiffVL. Most existing SD map-based localization methods rely on exhaustive geometric matching between Bird's-Eye View (BEV) features and map elements to compute the pose. In contrast, our approach fundamentally reformulates visual localization as a generative modeling task.
  • Figure 2: Architecture of DiffVL. As the first visual localization framework built upon diffusion models, our system pioneers a paradigm shift from traditional matching-based approaches to a generative formulation. The architecture accepts three critical inputs: (i) a monocular front-view RGB image capturing immediate scene context, (ii) standard-definition (SD) map data providing structural priors, and (iii) a noisy GPS trajectory offering coarse positional cues. Central to our innovation, the Image Encoding Module transforms perspective views into geometrically consistent Bird's-Eye-View (BEV) features, while the Map Encoding Module extracts topological representations from SDmaps. These complementary features undergo multi-modal fusion to generate conditioning features for our novel diffusion module—the core component that fundamentally redefines visual localization as a conditional generation task. Through iterative reverse diffusion steps, this module progressively denoises the corrupted GPS input, transforming unreliable sensor measurements into precise 3-DoF pose estimates. This generative approach marks the first successful application of diffusion models to visual localization, establishing a new trajectory refinement paradigm.
  • Figure 3: The localization results of our method on the KITTI dataset. In these visualizations, the red trajectory represents the ground truth (GT) GPS trajectory from the dataset, while the blue trajectory is the noisy GPS trajectory we synthetically generate. Given the noisy blue trajectory and a single image as input, our method produces the refined green "Generated Location" trajectory.