Table of Contents
Fetching ...

RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings

Aayush Dhakal, Srikumar Sastry, Subash Khanal, Adeel Ahmad, Eric Xing, Nathan Jacobs

TL;DR

This work tackles the loss of image-specific visual information in geo-embeddings learned through contrastive location–image alignment. It introduces RANGE, a retrieval-augmented neural field that estimates high-resolution visual features by semantically and spatially aggregating from a compact database, enabling multi-resolution geo-embeddings via RANGE^+. Across diverse tasks including biome and ecoregion classification, country labeling, and ERA5 climate prediction, RANGE and RANGE^+ consistently outperform state-of-the-art baselines, with notable gains and robustness to database size. The approach provides a practical, scalable way to fuse rich visual context into geographic representations, yielding improved priors for fine-grained tasks and enabling multi-frequency embeddings, while preserving efficiency and offering avenues for future extensions to broader location–image alignment problems.

Abstract

The choice of representation for geographic location significantly impacts the accuracy of models for a broad range of geospatial tasks, including fine-grained species classification, population density estimation, and biome classification. Recent works like SatCLIP and GeoCLIP learn such representations by contrastively aligning geolocation with co-located images. While these methods work exceptionally well, in this paper, we posit that the current training strategies fail to fully capture the important visual features. We provide an information-theoretic perspective on why the resulting embeddings from these methods discard crucial visual information that is important for many downstream tasks. To solve this problem, we propose a novel retrieval-augmented strategy called RANGE. We build our method on the intuition that the visual features of a location can be estimated by combining the visual features from multiple similar-looking locations. We evaluate our method across a wide variety of tasks. Our results show that RANGE outperforms the existing state-of-the-art models with significant margins in most tasks. We show gains of up to 13.1% on classification tasks and 0.145 $R^2$ on regression tasks. All our code and models will be made available at: https://github.com/mvrl/RANGE.

RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings

TL;DR

This work tackles the loss of image-specific visual information in geo-embeddings learned through contrastive location–image alignment. It introduces RANGE, a retrieval-augmented neural field that estimates high-resolution visual features by semantically and spatially aggregating from a compact database, enabling multi-resolution geo-embeddings via RANGE^+. Across diverse tasks including biome and ecoregion classification, country labeling, and ERA5 climate prediction, RANGE and RANGE^+ consistently outperform state-of-the-art baselines, with notable gains and robustness to database size. The approach provides a practical, scalable way to fuse rich visual context into geographic representations, yielding improved priors for fine-grained tasks and enabling multi-frequency embeddings, while preserving efficiency and offering avenues for future extensions to broader location–image alignment problems.

Abstract

The choice of representation for geographic location significantly impacts the accuracy of models for a broad range of geospatial tasks, including fine-grained species classification, population density estimation, and biome classification. Recent works like SatCLIP and GeoCLIP learn such representations by contrastively aligning geolocation with co-located images. While these methods work exceptionally well, in this paper, we posit that the current training strategies fail to fully capture the important visual features. We provide an information-theoretic perspective on why the resulting embeddings from these methods discard crucial visual information that is important for many downstream tasks. To solve this problem, we propose a novel retrieval-augmented strategy called RANGE. We build our method on the intuition that the visual features of a location can be estimated by combining the visual features from multiple similar-looking locations. We evaluate our method across a wide variety of tasks. Our results show that RANGE outperforms the existing state-of-the-art models with significant margins in most tasks. We show gains of up to 13.1% on classification tasks and 0.145 on regression tasks. All our code and models will be made available at: https://github.com/mvrl/RANGE.

Paper Structure

This paper contains 23 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Adding explicit visual features allows us to generate more high-resolution location embeddings. We use a retrieval strategy that allows us to generate multi-resolution retrieval-augmented neural field of geo-embeddings (RANGE).
  • Figure 2: Framework of RANGE. (a) In the training stage, a shared embedding space is learned between locations and images. (b) We create a database of low-resolution and high-resolution image embeddings using the trained projection layer and a powerful pretrained image model, respectively. (c) During inference, we use a location as the query, low-resolution image embeddings as keys, and high-resolution image embeddings as values. Using our retriever function, we compute the approximate high-resolution embeddings for the query. We concatenate ($\oplus$) the approximated visual feature with our query embedding.
  • Figure 3: Performance of our model with respect to the database size. The results show that compared to RANGE-HAVER, both RANGE and RANGE$^{+}$ are very robust to changes in database size. We can maintain the same performance even when only using 10% of the samples in the database.
  • Figure 4: We visualize the geo-embeddings from different models by projecting them into a 3-dimensional vector using Independent Component Analysis (ICA). The results suggest that by explicitly adding visual features, our method learns more high-frequency information compared to the existing models.
  • Figure 5: Interpolating the $\beta$ parameter in RANGE$^+$ allows us to control the spatial smoothness of our embeddings. The results show that RANGE$^+$ can be used to generate neural fields of geo-embeddings at multiple frequencies.
  • ...and 1 more figures