Table of Contents
Fetching ...

Temporal Attention for Cross-View Sequential Image Localization

Dong Yuan, Frederic Maire, Feras Dayoub

Abstract

This paper introduces a novel approach to enhancing cross-view localization, focusing on the fine-grained, sequential localization of street-view images within a single known satellite image patch, a significant departure from traditional one-to-one image retrieval methods. By expanding to sequential image fine-grained localization, our model, equipped with a novel Temporal Attention Module (TAM), leverages contextual information to significantly improve sequential image localization accuracy. Our method shows substantial reductions in both mean and median localization errors on the Cross-View Image Sequence (CVIS) dataset, outperforming current state-of-the-art single-image localization techniques. Additionally, by adapting the KITTI-CVL dataset into sequential image sets, we not only offer a more realistic dataset for future research but also demonstrate our model's robust generalization capabilities across varying times and areas, evidenced by a 75.3% reduction in mean distance error in cross-view sequential image localization.

Temporal Attention for Cross-View Sequential Image Localization

Abstract

This paper introduces a novel approach to enhancing cross-view localization, focusing on the fine-grained, sequential localization of street-view images within a single known satellite image patch, a significant departure from traditional one-to-one image retrieval methods. By expanding to sequential image fine-grained localization, our model, equipped with a novel Temporal Attention Module (TAM), leverages contextual information to significantly improve sequential image localization accuracy. Our method shows substantial reductions in both mean and median localization errors on the Cross-View Image Sequence (CVIS) dataset, outperforming current state-of-the-art single-image localization techniques. Additionally, by adapting the KITTI-CVL dataset into sequential image sets, we not only offer a more realistic dataset for future research but also demonstrate our model's robust generalization capabilities across varying times and areas, evidenced by a 75.3% reduction in mean distance error in cross-view sequential image localization.
Paper Structure (24 sections, 6 equations, 5 figures, 2 tables)

This paper contains 24 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Comparison of two cross-view localization tasks. Top: Cross-view fine-grained localization primarily emphasizes discrete location predictions, where street-view images are widely spaced apart and exhibit discontinuous distribution along the route or area of interest. Bottom: Cross-view sequential image localization is centered around predicting the location in a satellite image of each street-view image belonging to a sequence, aligning more closely with practical localization applications.
  • Figure 2: Overview of the proposed architecture. The feature extractors extract feature maps independently from two different views. These feature maps are then fed into self-attention blocks (SAB) and cross-attention blocks (CAB) to generate fused features. The fusion features are further integrated with the hidden state $h^{t-1}$ from the previous time step within the temporal attention module. Finally, the obtained features are utilized for the prediction of locations.
  • Figure 3: Illustration of Temporal Attention Module (TAM). The fusion feature $F^{t}_f$ undergoes projection to generate the Query $Q_f$, while the hidden state $h^{t-1}$ from the previous time step is projected to form the Key $K_h$ and Value $V_h$. The Position Encoding is applied to both $Q_f$ and $K_h$ before they are passed into the multi-head attention layers.
  • Figure 4: A sequence sample segmented from the KITTI-CVL dataset. Each red marker denotes the location of a street-view image. The red arrow indicates the direction of travel of the vehicle.
  • Figure 5: Qualitative localization result visualization on the CVIS zhang2023cross dataset. We use blue, purple and red points to denote the predicted consecutive street-view locations of ground truth, the baseline method and our proposed method. The figure illustrates that the baseline method tends to cluster multiple predicted locations in the same area, whereas our approach demonstrates a closer alignment with the actual ground truth locations.