Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization
Lukas Haas, Silas Alberti, Michal Skreta
TL;DR
Open-domain image geolocalization remains challenging due to distribution shifts and the need to reason with world knowledge. The authors propose StreetCLIP, which uses synthetic caption domain-specific pretraining to ground CLIP in geographic context, effectively learning batch-specific generalized zero-shot learners. On planet-scale Street View data, StreetCLIP achieves state-of-the-art zero-shot performance on IM2GPS and IM2GPS3K, outperforming supervised models trained on millions of images. The method is generalizable to other domains and is released publicly, offering a robust zero-shot backbone for geolocalization and related tasks.
Abstract
Image geolocalization is the challenging task of predicting the geographic coordinates of origin for a given photo. It is an unsolved problem relying on the ability to combine visual clues with general knowledge about the world to make accurate predictions across geographies. We present $\href{https://huggingface.co/geolocal/StreetCLIP}{\text{StreetCLIP}}$, a robust, publicly available foundation model not only achieving state-of-the-art performance on multiple open-domain image geolocalization benchmarks but also doing so in a zero-shot setting, outperforming supervised models trained on more than 4 million images. Our method introduces a meta-learning approach for generalized zero-shot learning by pretraining CLIP from synthetic captions, grounding CLIP in a domain of choice. We show that our method effectively transfers CLIP's generalized zero-shot capabilities to the domain of image geolocalization, improving in-domain generalized zero-shot performance without finetuning StreetCLIP on a fixed set of classes.
