Table of Contents
Fetching ...

Adversarial Exploitation of Data Diversity Improves Visual Localization

Sihang Li, Siqi Tan, Bowen Chang, Jing Zhang, Chen Feng, Yiming Li

TL;DR

This work tackles the generalization gap in absolute pose regression by introducing RAP, a two-branch training framework that leverages appearance-diverse data synthesized via 3D Gaussian Splats and adversarial feature alignment to bridge synthetic-real gaps. A Transformer-based pose regressor ingests appearance-varying features, while a second branch continually augments training with perturbed poses and appearances, yielding strong performance gains across Cambridge, MARS, Aachen, and 7-Scenes, including robust operation under dramatic appearance changes and dynamic content. Extensive ablations reveal the critical roles of appearance diversity, data synthesis quality, and adversarial alignment in achieving generalization beyond memorization. The work also demonstrates practical benefits, including high inference throughput and effective post-refinement (RAPref), while outlining limitations and avenues for future integration of geometric priors and temporal information.

Abstract

Visual localization, which estimates a camera's pose within a known scene, is a fundamental capability for autonomous systems. While absolute pose regression (APR) methods have shown promise for efficient inference, they often struggle with generalization. Recent approaches attempt to address this through data augmentation with varied viewpoints, yet they overlook a critical factor: appearance diversity. In this work, we identify appearance variation as the key to robust localization. Specifically, we first lift real 2D images into 3D Gaussian Splats with varying appearance and deblurring ability, enabling the synthesis of diverse training data that varies not just in poses but also in environmental conditions such as lighting and weather. To fully unleash the potential of the appearance-diverse data, we build a two-branch joint training pipeline with an adversarial discriminator to bridge the syn-to-real gap. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, reducing translation and rotation errors by 50\% and 41\% on indoor datasets, and 38\% and 44\% on outdoor datasets. Most notably, our method shows remarkable robustness in dynamic driving scenarios under varying weather conditions and in day-to-night scenarios, where previous APR methods fail. Project Page: https://ai4ce.github.io/RAP/

Adversarial Exploitation of Data Diversity Improves Visual Localization

TL;DR

This work tackles the generalization gap in absolute pose regression by introducing RAP, a two-branch training framework that leverages appearance-diverse data synthesized via 3D Gaussian Splats and adversarial feature alignment to bridge synthetic-real gaps. A Transformer-based pose regressor ingests appearance-varying features, while a second branch continually augments training with perturbed poses and appearances, yielding strong performance gains across Cambridge, MARS, Aachen, and 7-Scenes, including robust operation under dramatic appearance changes and dynamic content. Extensive ablations reveal the critical roles of appearance diversity, data synthesis quality, and adversarial alignment in achieving generalization beyond memorization. The work also demonstrates practical benefits, including high inference throughput and effective post-refinement (RAPref), while outlining limitations and avenues for future integration of geometric priors and temporal information.

Abstract

Visual localization, which estimates a camera's pose within a known scene, is a fundamental capability for autonomous systems. While absolute pose regression (APR) methods have shown promise for efficient inference, they often struggle with generalization. Recent approaches attempt to address this through data augmentation with varied viewpoints, yet they overlook a critical factor: appearance diversity. In this work, we identify appearance variation as the key to robust localization. Specifically, we first lift real 2D images into 3D Gaussian Splats with varying appearance and deblurring ability, enabling the synthesis of diverse training data that varies not just in poses but also in environmental conditions such as lighting and weather. To fully unleash the potential of the appearance-diverse data, we build a two-branch joint training pipeline with an adversarial discriminator to bridge the syn-to-real gap. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, reducing translation and rotation errors by 50\% and 41\% on indoor datasets, and 38\% and 44\% on outdoor datasets. Most notably, our method shows remarkable robustness in dynamic driving scenarios under varying weather conditions and in day-to-night scenarios, where previous APR methods fail. Project Page: https://ai4ce.github.io/RAP/

Paper Structure

This paper contains 48 sections, 12 equations, 22 figures, 13 tables.

Figures (22)

  • Figure 1: We propose RAP, a novel pipeline to train robust APR models. We lift real-world 2D images into 3D Gaussian Splats kerbl20233d to synthesize images with diverse appearances and poses, improving model generalizability. We also introduce an adversarial discriminator, mitigating the syn-to-real gap to learn robust features. Together, we achieve state-of-the-art performance.
  • Figure 2: Pipeline of RAP. We lift multiple RGB video sequences into 3D Gaussian Splats, which serve as our data engine. The branch-1 (see Sec. \ref{['branch-1']}) inputs paired real and synthetic images to regress poses, with a discriminator to bridge the syn-to-real gap. The branch-2 (see Sec. \ref{['branch-2']}) generates views with novel poses and appearances, which are fed into the same pose regressor as additional supervision.
  • Figure 3: Qualitative comparison of camera pose estimation errors between a) DFNet chen2022dfnet and b) our RAP framework across five scenes on the Cambridge Landmarks dataset kendall2015posenet. Our RAP framework estimates trajectories that more closely follow the ground truth, with significantly reduced rotation and position errors compared to DFNet chen2022dfnet.
  • Figure 4: Visualization of RAPref on MARS li2024multiagent. In each sub-figure, a diagonal line separates the "Predicted" (rendered from the refined pose) and "GT" (ground truth) sections. Smooth alignment along this boundary shows RAPref's improved pose accuracy.
  • Figure 5: Visualization of the localization errors of RAPref on the 7-Scenes dataset shotton2013scene.
  • ...and 17 more figures