Table of Contents
Fetching ...

VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation

Juhye Park, Wooju Lee, Dasol Hong, Changki Sung, Youngwoo Seo, Dongwan Kang, Hyun Myung

Abstract

Accurate global localization is crucial for autonomous driving and robotics, but GNSS-based approaches often degrade due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. We propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to establish horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to resolve vertical misalignment, explicitly mitigating the viewpoint gap. A view-reconstruction loss is introduced to strengthen the view invariance further, encouraging the derived representations to reconstruct the original and cross-view images. Experiments on the KITTI and VIGOR datasets demonstrate that VIRD outperforms the state-of-the-art methods without orientation priors, reducing median position and orientation errors by 50.7% and 76.5% on KITTI, and 18.0% and 46.8% on VIGOR, respectively.

VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation

Abstract

Accurate global localization is crucial for autonomous driving and robotics, but GNSS-based approaches often degrade due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. We propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to establish horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to resolve vertical misalignment, explicitly mitigating the viewpoint gap. A view-reconstruction loss is introduced to strengthen the view invariance further, encouraging the derived representations to reconstruct the original and cross-view images. Experiments on the KITTI and VIGOR datasets demonstrate that VIRD outperforms the state-of-the-art methods without orientation priors, reducing median position and orientation errors by 50.7% and 76.5% on KITTI, and 18.0% and 46.8% on VIGOR, respectively.
Paper Structure (69 sections, 8 equations, 13 figures, 11 tables, 2 algorithms)

This paper contains 69 sections, 8 equations, 13 figures, 11 tables, 2 algorithms.

Figures (13)

  • Figure 1: Ground-view and satellite-view images exhibit a large viewpoint gap due to misalignment along both the horizontal and vertical axes on the image plane. Previous geometry-based methods, such as polar and projective transformations, partially mitigated this issue by ensuring horizontal correspondence but could not fully address vertical misalignment. For example, projective transformations often produce severe artifacts around vertical structures such as buildings. VIRD overcomes these limitations through dual-axis transformation, introducing a shared virtual vertical axis $a$ to establish consistent cross-view correspondence.
  • Figure 2: Overview of VIRD. VIRD constructs view-invariant descriptors through dual-axis transformation. The process begins by applying a polar transformation to transform the satellite features $F_s$ into a horizontally corresponding representation to the ground features $F_g$ for each candidate pose $\mathbf{p}_{c} \in \mathcal{P}$ (the set of candidate poses), generating the polar-transformed features $F_{s2p}$. The context-enhanced positional attention (CEPA) module then vertically transforms cross-view features through a positional attention mechanism. The resulting satellite and ground features, $F_{s2p'}$ and $F_{g'}$, are compressed along the vertical axis to generate orientation-aware descriptors $D_{s2p}$ and $D_g$. During training, a view-reconstruction loss is computed at the ground-truth pose $\mathbf{p}_{}^*$ to enforce view invariance by reconstructing both original and cross views. At inference, the final 3-DoF pose is estimated by combining descriptor matching with residual regression.
  • Figure 3: Schematic illustration of the context-enhanced positional attention (CEPA) module consisting of positional attention (PA) and context enhancement (CE). PA generates attention weights $\mathcal{A}_g$ and $\mathcal{A}_{s2p}$ by comparing the shared virtual positional encodings $P_a$ with the positional encodings of the ground and satellite views, $P_g$ and $P_{s2p}$, respectively. CE then adaptively refines the ground attention weights $\mathcal{A}_g$ using contextual information from the ground feature $F_g$, resulting in $\mathcal{A}_{g'}$. These attention weights vertically transform the cross views by weighting the features $F_g$ and $F_{s2p}$ with $\mathcal{A}_{g'}$ and $\mathcal{A}_{s2p}$, respectively.
  • Figure 4: Visualization of attention weights. It shows the activations corresponding to each shared virtual positional encoding (PE). Red indicates stronger activations. Orange and green dots indicate the real spatial correspondence between the views.
  • Figure 5: Visualization of CEPA's attention weights, view reconstruction, and pose estimation results from VIRD.
  • ...and 8 more figures