Table of Contents
Fetching ...

GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers

Manu S Pillai, Mamshad Nayeem Rizve, Mubarak Shah

TL;DR

GAReT tackles cross-view video geolocalization without relying on camera intrinsics or odometry by casting CVGL as a two-stage, transformer-based process. It introduces GeoAdapter, a transformer-adapter that aggregates image-level embeddings into a video-level representation, and TransRetriever, an encoder–decoder transformer that autoregressively selects temporally consistent GPS predictions from top-$k$ neighbors. The method localizes a street-view video to a large aerial region and then performs frame-level retrieval within that region, achieving state-of-the-art results on GAMa and SeqGeo while reducing computational cost relative to prior video-based CVGL methods. This approach offers a practical, scalable solution for real-world CVGL tasks with improved temporal coherence and without requiring costly camera/odometry data.

Abstract

Cross-view video geo-localization (CVGL) aims to derive GPS trajectories from street-view videos by aligning them with aerial-view images. Despite their promising performance, current CVGL methods face significant challenges. These methods use camera and odometry data, typically absent in real-world scenarios. They utilize multiple adjacent frames and various encoders for feature extraction, resulting in high computational costs. Moreover, these approaches independently predict each street-view frame's location, resulting in temporally inconsistent GPS trajectories. To address these challenges, in this work, we propose GAReT, a fully transformer-based method for CVGL that does not require camera and odometry data. We introduce GeoAdapter, a transformer-adapter module designed to efficiently aggregate image-level representations and adapt them for video inputs. Specifically, we train a transformer encoder on video frames and aerial images, then freeze the encoder to optimize the GeoAdapter module to obtain video-level representation. To address temporally inconsistent trajectories, we introduce TransRetriever, an encoder-decoder transformer model that predicts GPS locations of street-view frames by encoding top-k nearest neighbor predictions per frame and auto-regressively decoding the best neighbor based on the previous frame's predictions. Our method's effectiveness is validated through extensive experiments, demonstrating state-of-the-art performance on benchmark datasets. Our code is available at https://github.com/manupillai308/GAReT.

GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers

TL;DR

GAReT tackles cross-view video geolocalization without relying on camera intrinsics or odometry by casting CVGL as a two-stage, transformer-based process. It introduces GeoAdapter, a transformer-adapter that aggregates image-level embeddings into a video-level representation, and TransRetriever, an encoder–decoder transformer that autoregressively selects temporally consistent GPS predictions from top- neighbors. The method localizes a street-view video to a large aerial region and then performs frame-level retrieval within that region, achieving state-of-the-art results on GAMa and SeqGeo while reducing computational cost relative to prior video-based CVGL methods. This approach offers a practical, scalable solution for real-world CVGL tasks with improved temporal coherence and without requiring costly camera/odometry data.

Abstract

Cross-view video geo-localization (CVGL) aims to derive GPS trajectories from street-view videos by aligning them with aerial-view images. Despite their promising performance, current CVGL methods face significant challenges. These methods use camera and odometry data, typically absent in real-world scenarios. They utilize multiple adjacent frames and various encoders for feature extraction, resulting in high computational costs. Moreover, these approaches independently predict each street-view frame's location, resulting in temporally inconsistent GPS trajectories. To address these challenges, in this work, we propose GAReT, a fully transformer-based method for CVGL that does not require camera and odometry data. We introduce GeoAdapter, a transformer-adapter module designed to efficiently aggregate image-level representations and adapt them for video inputs. Specifically, we train a transformer encoder on video frames and aerial images, then freeze the encoder to optimize the GeoAdapter module to obtain video-level representation. To address temporally inconsistent trajectories, we introduce TransRetriever, an encoder-decoder transformer model that predicts GPS locations of street-view frames by encoding top-k nearest neighbor predictions per frame and auto-regressively decoding the best neighbor based on the previous frame's predictions. Our method's effectiveness is validated through extensive experiments, demonstrating state-of-the-art performance on benchmark datasets. Our code is available at https://github.com/manupillai308/GAReT.
Paper Structure (20 sections, 3 equations, 4 figures, 4 tables)

This paper contains 20 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of our proposed approach GAReT. (A) We begin by optimizing our image transformer encoders $\mathbf{T_{a,s}}$(B) with street-view frame $V^s_k$ and matching small aerial image $I_{s_k}^a$ pair. (C) Then, for adapting our image encoder to video inputs, we add our GeoAdapter $\mathbf{G_A}$ module and only optimize the adapter parameters with video pairs as inputs, i.e., a street-view video $V^s$ and corresponding large aerial image $I^a_L$. For training, we sample every $k^\text{th}$ frame from the street-view video and partition the large aerial image into non-overlapping patches. (D) In $\mathbf{G_A}$, we apply temporal self-attention (TSA) computation only on the CLS tokens. For TSA computation, we reuse the spatial self-attention weights. (E) During inference, we first perform a Sequence-to-Image inference procedure, where given a query street-view video, our unified module $\mathbf{U = \{T, G_A\}}$ produces feature embeddings for both the $V^s$ and $I^a_L$. Then, using the embeddings, we retrieve the $t$ nearest neighbor large aerial images (here we show $t=1$) and construct a small aerial image gallery $\mathcal{G}$. (F) Finally, $\mathbf{G_A}$ is removed, and feature embeddings for $I_{s_k}^a$ and $V^s_k$ are obtained. These features are then passed to our TransRetriever $\mathbf{T_{AR}}$ model to obtain final frame-by-frame GPS predictions to construct a GPS trajectory.
  • Figure 2: Examples of trajectories obtained using NN (A) based retrieval and our proposed TransRetriever (B). NN-based retrieval heavenly suffers from temporally inconsistent predictions depicted by the jumps in the trajectory while TransRetriever predictions are globally consistent which preserves the temporal coherence of the predictions.
  • Figure 3: Comparison of our proposed GeoAdapter module with different variants of architectural design. With a top-1 recall rate of 50.7 (50.69), our proposed architecture best suits CVGL.
  • Figure 4: Top-$k$ retrieval recall score of frame-by-frame inference using our method when multiple large aerial images are taken to create the gallery.