Table of Contents
Fetching ...

CVGL: Causal Learning and Geometric Topology

Songsong Ouyang, Yingying Zhu

Abstract

Cross-view geo-localization (CVGL) aims to estimate the geographic location of a street image by matching it with a corresponding aerial image. This is critical for autonomous navigation and mapping in complex real-world scenarios. However, the task remains challenging due to significant viewpoint differences and the influence of confounding factors. To tackle these issues, we propose the Causal Learning and Geometric Topology (CLGT) framework, which integrates two key components: a Causal Feature Extractor (CFE) that mitigates the influence of confounding factors by leveraging causal intervention to encourage the model to focus on stable, task-relevant semantics; and a Geometric Topology Fusion (GT Fusion) module that injects Bird's Eye View (BEV) road topology into street features to alleviate cross-view inconsistencies caused by extreme perspective changes. Additionally, we introduce a Data-Adaptive Pooling (DA Pooling) module to enhance the representation of semantically rich regions. Extensive experiments on CVUSA, CVACT, and their robustness-enhanced variants (CVUSA-C-ALL and CVACT-C-ALL) demonstrate that CLGT achieves state-of-the-art performance, particularly under challenging real-world corruptions. Our codes are available at https://github.com/oyss-szu/CLGT.

CVGL: Causal Learning and Geometric Topology

Abstract

Cross-view geo-localization (CVGL) aims to estimate the geographic location of a street image by matching it with a corresponding aerial image. This is critical for autonomous navigation and mapping in complex real-world scenarios. However, the task remains challenging due to significant viewpoint differences and the influence of confounding factors. To tackle these issues, we propose the Causal Learning and Geometric Topology (CLGT) framework, which integrates two key components: a Causal Feature Extractor (CFE) that mitigates the influence of confounding factors by leveraging causal intervention to encourage the model to focus on stable, task-relevant semantics; and a Geometric Topology Fusion (GT Fusion) module that injects Bird's Eye View (BEV) road topology into street features to alleviate cross-view inconsistencies caused by extreme perspective changes. Additionally, we introduce a Data-Adaptive Pooling (DA Pooling) module to enhance the representation of semantically rich regions. Extensive experiments on CVUSA, CVACT, and their robustness-enhanced variants (CVUSA-C-ALL and CVACT-C-ALL) demonstrate that CLGT achieves state-of-the-art performance, particularly under challenging real-world corruptions. Our codes are available at https://github.com/oyss-szu/CLGT.
Paper Structure (15 sections, 6 equations, 6 figures, 7 tables)

This paper contains 15 sections, 6 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Structural Causal Model (SCM) for cross-view geo-localization. Nodes represent variables and arrows denote dependencies.
  • Figure 2: Visualization of low, mid, and high frequency components of a street image. Low and high frequencies emphasize domain-specific information such as style, while the mid-frequency band retains domain-invariant cues such as structure and shape.
  • Figure 3: Overview of the proposed Causal Learning and Geometric Topology (CLGT) framework. The road topology information from BEV is fused via the GT Fusion module to obtain the fused features, which are then used for location matching with aerial image features. The causally enhanced street features from the CFE module provide causal supervision, and the DA Pooling module performs final feature extraction.
  • Figure 4: Illustration of the Causal Feature Extractor (CFE). The input image is transformed into the frequency domain via Discrete Cosine Transform (DCT). A Content-aware Mask (CaM) strategy dynamically separates mid-frequency causal components from low- and high-frequency non-causal components. A Gaussian function is applied to only to the non-causal components (e.g., lighting and brightness) to reduce their influence. Both causal and non-causal parts are then reconstructed via inverse (IDCT) to obtain the causally enhanced image.
  • Figure 5: Overview of the GT Fusion and DA Pooling modules. GT Fusion uses cross-attention to exchange semantic information between street and BEV features, then uses Dual Dynamic Fusion (DDF) to enhance fusion robustness. DA Pooling employs a gating mechanism to adaptively weight features, highlighting the most informative ones.
  • ...and 1 more figures