LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild
Zhiqiang Wang, Dejia Xu, Rana Muhammad Shahroz Khan, Yanbin Lin, Zhiwen Fan, Xingquan Zhu
TL;DR
This work tackles image geolocation in-the-wild using large multimodal language models by introducing a Street View–based dataset and a comprehensive benchmark that covers both training-free and training-based settings across closed-source and open-source LMMs. It combines CLIP-based dynamic retrieval, multiple prompting strategies, and finetuning to evaluate country-level localization performance. Key findings show that closed-source models like GPT-4V and Gemini achieve strong geolocation capability, while open-source models can approach these results with fine-tuning, and that incorporating reference images substantially boosts some models. The study highlights the potential and limitations of LMMs for geolocation and outlines avenues for finer-grained, city- or coordinate-level localization in future work.
Abstract
Image geolocation is a critical task in various image-understanding applications. However, existing methods often fail when analyzing challenging, in-the-wild images. Inspired by the exceptional background knowledge of multimodal language models, we systematically evaluate their geolocation capabilities using a novel image dataset and a comprehensive evaluation framework. We first collect images from various countries via Google Street View. Then, we conduct training-free and training-based evaluations on closed-source and open-source multi-modal language models. we conduct both training-free and training-based evaluations on closed-source and open-source multimodal language models. Our findings indicate that closed-source models demonstrate superior geolocation abilities, while open-source models can achieve comparable performance through fine-tuning.
