Table of Contents
Fetching ...

Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization

Lukas Haas, Silas Alberti, Michal Skreta

TL;DR

Open-domain image geolocalization remains challenging due to distribution shifts and the need to reason with world knowledge. The authors propose StreetCLIP, which uses synthetic caption domain-specific pretraining to ground CLIP in geographic context, effectively learning batch-specific generalized zero-shot learners. On planet-scale Street View data, StreetCLIP achieves state-of-the-art zero-shot performance on IM2GPS and IM2GPS3K, outperforming supervised models trained on millions of images. The method is generalizable to other domains and is released publicly, offering a robust zero-shot backbone for geolocalization and related tasks.

Abstract

Image geolocalization is the challenging task of predicting the geographic coordinates of origin for a given photo. It is an unsolved problem relying on the ability to combine visual clues with general knowledge about the world to make accurate predictions across geographies. We present $\href{https://huggingface.co/geolocal/StreetCLIP}{\text{StreetCLIP}}$, a robust, publicly available foundation model not only achieving state-of-the-art performance on multiple open-domain image geolocalization benchmarks but also doing so in a zero-shot setting, outperforming supervised models trained on more than 4 million images. Our method introduces a meta-learning approach for generalized zero-shot learning by pretraining CLIP from synthetic captions, grounding CLIP in a domain of choice. We show that our method effectively transfers CLIP's generalized zero-shot capabilities to the domain of image geolocalization, improving in-domain generalized zero-shot performance without finetuning StreetCLIP on a fixed set of classes.

Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization

TL;DR

Open-domain image geolocalization remains challenging due to distribution shifts and the need to reason with world knowledge. The authors propose StreetCLIP, which uses synthetic caption domain-specific pretraining to ground CLIP in geographic context, effectively learning batch-specific generalized zero-shot learners. On planet-scale Street View data, StreetCLIP achieves state-of-the-art zero-shot performance on IM2GPS and IM2GPS3K, outperforming supervised models trained on millions of images. The method is generalizable to other domains and is released publicly, offering a robust zero-shot backbone for geolocalization and related tasks.

Abstract

Image geolocalization is the challenging task of predicting the geographic coordinates of origin for a given photo. It is an unsolved problem relying on the ability to combine visual clues with general knowledge about the world to make accurate predictions across geographies. We present , a robust, publicly available foundation model not only achieving state-of-the-art performance on multiple open-domain image geolocalization benchmarks but also doing so in a zero-shot setting, outperforming supervised models trained on more than 4 million images. Our method introduces a meta-learning approach for generalized zero-shot learning by pretraining CLIP from synthetic captions, grounding CLIP in a domain of choice. We show that our method effectively transfers CLIP's generalized zero-shot capabilities to the domain of image geolocalization, improving in-domain generalized zero-shot performance without finetuning StreetCLIP on a fixed set of classes.
Paper Structure (29 sections, 5 equations, 2 figures, 2 tables)

This paper contains 29 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: StreetCLIP's Synthetic Caption Pretraining. We formulate the task of image geolocalization in natural language via synthetic captions at various levels of geographic granularity. For every batch, our model synthesizes a generalized zero-shot learner, thus learning how to zero-shot learn within a specific domain. The figure layout draws on radford21a.
  • Figure 2: Hierarchical Linear Probing Strategy. During inference, StreetCLIP synthesizes both a country-level and a city-level generalized zero-shot learner using two different caption templates. Given an input image, our method first identifies the country it deems to be the most likely image origin and then refines its guess within that country's 30 most populous cities.