Table of Contents
Fetching ...

GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings

Angel Daruna, Nicholas Meegan, Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar

TL;DR

GeoSURGE addresses global image geo-localization by learning a hierarchical geographic embedding and enriching visual representations through semantic fusion. It models geography as a partitioned hierarchy of geocells, each represented by a learnable embedding, and trains via contrastive learning to align query visuals with geographic features. A semantic fusion module using latent cross-attention combines RGB appearance with semantic segmentation to produce a robust visual representation. Empirically, GeoSURGE achieves state-of-the-art results on 22 of 25 metrics across five benchmarks, underscoring the value of hierarchical geographic representations and semantic augmentation for precise geo-localization.

Abstract

Worldwide visual geo-localization seeks to determine the geographic location of an image anywhere on Earth using only its visual content. Learned representations of geography for visual geo-localization remain an active research topic despite much progress. We formulate geo-localization as aligning the visual representation of the query image with a learned geographic representation. Our novel geographic representation explicitly models the world as a hierarchy of geographic embeddings. Additionally, we introduce an approach to efficiently fuse the appearance features of the query image with its semantic segmentation map, forming a robust visual representation. Our main experiments demonstrate improved all-time bests in 22 out of 25 metrics measured across five benchmark datasets compared to prior state-of-the-art (SOTA) methods and recent Large Vision-Language Models (LVLMs). Additional ablation studies support the claim that these gains are primarily driven by the combination of geographic and visual representations.

GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings

TL;DR

GeoSURGE addresses global image geo-localization by learning a hierarchical geographic embedding and enriching visual representations through semantic fusion. It models geography as a partitioned hierarchy of geocells, each represented by a learnable embedding, and trains via contrastive learning to align query visuals with geographic features. A semantic fusion module using latent cross-attention combines RGB appearance with semantic segmentation to produce a robust visual representation. Empirically, GeoSURGE achieves state-of-the-art results on 22 of 25 metrics across five benchmarks, underscoring the value of hierarchical geographic representations and semantic augmentation for precise geo-localization.

Abstract

Worldwide visual geo-localization seeks to determine the geographic location of an image anywhere on Earth using only its visual content. Learned representations of geography for visual geo-localization remain an active research topic despite much progress. We formulate geo-localization as aligning the visual representation of the query image with a learned geographic representation. Our novel geographic representation explicitly models the world as a hierarchy of geographic embeddings. Additionally, we introduce an approach to efficiently fuse the appearance features of the query image with its semantic segmentation map, forming a robust visual representation. Our main experiments demonstrate improved all-time bests in 22 out of 25 metrics measured across five benchmark datasets compared to prior state-of-the-art (SOTA) methods and recent Large Vision-Language Models (LVLMs). Additional ablation studies support the claim that these gains are primarily driven by the combination of geographic and visual representations.

Paper Structure

This paper contains 11 sections, 1 equation, 4 figures, 8 tables.

Figures (4)

  • Figure 1: GeoSURGE Approach Overview: The location of an input image is predicted via hierarchical inference, by matching the visual representation of the image against the geographic representation, which is learned beforehand. The visual representation is generated from the semantic fusion module, which enriches appearance features with semantic segmentation.
  • Figure 2: Diagram of GeoSURGE's semantic fusion blocks.
  • Figure 3: Sample successful GeoSURGE predictions. Best viewed when zoomed.
  • Figure 4: Sample unsuccessful GeoSURGE predictions. Best viewed when zoomed.