STRMs: Spatial Temporal Reasoning Models for Vision-Based Localization Rivaling GPS Precision
Hin Wai Lui, Jeffrey L. Krichmar
TL;DR
This work tackles centimeter-level vision-based localization in dynamic, large-scale environments by reframing localization as a generative regression task. It introduces Spatial Temporal Reasoning Models (STRMs), specifically VAE-RNN and VAE-Transformer, that map sequences of first-person views to global map perspectives and coordinates without relying on dense satellite image databases. The Transformer-based STRM achieves state-of-the-art localization performance (AUC up to $0.777$) with a lightweight model (~$77$ MB) and real-time inference (~$10.9$ FPS) that rivals smartphone GPS in real-world scenarios, while offering substantial computational efficiency. The results support a cognitive-inspired, regionally specialized deployment strategy for location-specific autonomous driving and point toward future work on temporal robustness and long-term environmental changes.
Abstract
This paper explores vision-based localization through a biologically-inspired approach that mirrors how humans and animals link views or perspectives when navigating their world. We introduce two sequential generative models, VAE-RNN and VAE-Transformer, which transform first-person perspective (FPP) observations into global map perspective (GMP) representations and precise geographical coordinates. Unlike retrieval-based methods, our approach frames localization as a generative task, learning direct mappings between perspectives without relying on dense satellite image databases. We evaluate these models across two real-world environments: a university campus navigated by a Jackal robot and an urban downtown area navigated by a Tesla sedan. The VAE-Transformer achieves impressive precision, with median deviations of 2.29m (1.37% of environment size) and 4.45m (0.35% of environment size) respectively, outperforming both VAE-RNN and prior cross-view geo-localization approaches. Our comprehensive Localization Performance Characteristics (LPC) analysis demonstrates superior performance with the VAE-Transformer achieving an AUC of 0.777 compared to 0.295 for VIGOR 200 and 0.225 for TransGeo, establishing a new state-of-the-art in vision-based localization. In some scenarios, our vision-based system rivals commercial smartphone GPS accuracy (AUC of 0.797) while requiring 5x less GPU memory and delivering 3x faster inference than existing methods in cross-view geo-localization. These results demonstrate that models inspired by biological spatial navigation can effectively memorize complex, dynamic environments and provide precise localization with minimal computational resources.
