Table of Contents
Fetching ...

Spatially-Weighted CLIP for Street-View Geo-localization

Ting Han, Fengjiao Li, Chunsong Chen, Haoling Huang, Yiping Chen, Meiliu Wu

Abstract

This paper proposes Spatially-Weighted CLIP (SW-CLIP), a novel framework for street-view geo-localization that explicitly incorporates spatial autocorrelation into vision-language contrastive learning. Unlike conventional CLIP-based methods that treat all non-matching samples as equally negative, SW-CLIP leverages Tobler's First Law of Geography to model geographic relationships through distance-aware soft supervision. Specifically, we introduce a location-as-text representation to encode geographic positions and replace one-hot InfoNCE targets with spatially weighted soft labels derived from geodesic distance. Additionally, a neighborhood-consistency regularization is employed to preserve local spatial structure in the embedding space. Experiments on a multi-city dataset demonstrate that SW-CLIP significantly improves geo-localization accuracy, reduces long-tail errors, and enhances spatial coherence compared to standard CLIP. The results highlight the importance of shifting from semantic alignment to geographic alignment for robust geo-localization and provide a general paradigm for integrating spatial principles into multimodal representation learning.

Spatially-Weighted CLIP for Street-View Geo-localization

Abstract

This paper proposes Spatially-Weighted CLIP (SW-CLIP), a novel framework for street-view geo-localization that explicitly incorporates spatial autocorrelation into vision-language contrastive learning. Unlike conventional CLIP-based methods that treat all non-matching samples as equally negative, SW-CLIP leverages Tobler's First Law of Geography to model geographic relationships through distance-aware soft supervision. Specifically, we introduce a location-as-text representation to encode geographic positions and replace one-hot InfoNCE targets with spatially weighted soft labels derived from geodesic distance. Additionally, a neighborhood-consistency regularization is employed to preserve local spatial structure in the embedding space. Experiments on a multi-city dataset demonstrate that SW-CLIP significantly improves geo-localization accuracy, reduces long-tail errors, and enhances spatial coherence compared to standard CLIP. The results highlight the importance of shifting from semantic alignment to geographic alignment for robust geo-localization and provide a general paradigm for integrating spatial principles into multimodal representation learning.

Paper Structure

This paper contains 13 sections, 6 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Motivation and overview of SW-CLIP. Standard CLIP training treats all non-matching samples in a mini-batch as equally negative, which can incorrectly penalize geographically nearby observations that share similar scene context. Guided by Tobler’s First Law of Geography and spatial autocorrelation, SW-CLIP replaces the hard one-hot supervision with a distance-aware spatial soft label: nearby locations receive higher similarity targets while distant locations are down-weighted. This geographic weighting reduces false-negative conflicts in contrastive learning and encourages embeddings to be both retrieval-friendly and spatially coherent for street-view geo-localization.