Table of Contents
Fetching ...

LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild

Zhiqiang Wang, Dejia Xu, Rana Muhammad Shahroz Khan, Yanbin Lin, Zhiwen Fan, Xingquan Zhu

TL;DR

This work tackles image geolocation in-the-wild using large multimodal language models by introducing a Street View–based dataset and a comprehensive benchmark that covers both training-free and training-based settings across closed-source and open-source LMMs. It combines CLIP-based dynamic retrieval, multiple prompting strategies, and finetuning to evaluate country-level localization performance. Key findings show that closed-source models like GPT-4V and Gemini achieve strong geolocation capability, while open-source models can approach these results with fine-tuning, and that incorporating reference images substantially boosts some models. The study highlights the potential and limitations of LMMs for geolocation and outlines avenues for finer-grained, city- or coordinate-level localization in future work.

Abstract

Image geolocation is a critical task in various image-understanding applications. However, existing methods often fail when analyzing challenging, in-the-wild images. Inspired by the exceptional background knowledge of multimodal language models, we systematically evaluate their geolocation capabilities using a novel image dataset and a comprehensive evaluation framework. We first collect images from various countries via Google Street View. Then, we conduct training-free and training-based evaluations on closed-source and open-source multi-modal language models. we conduct both training-free and training-based evaluations on closed-source and open-source multimodal language models. Our findings indicate that closed-source models demonstrate superior geolocation abilities, while open-source models can achieve comparable performance through fine-tuning.

LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild

TL;DR

This work tackles image geolocation in-the-wild using large multimodal language models by introducing a Street View–based dataset and a comprehensive benchmark that covers both training-free and training-based settings across closed-source and open-source LMMs. It combines CLIP-based dynamic retrieval, multiple prompting strategies, and finetuning to evaluate country-level localization performance. Key findings show that closed-source models like GPT-4V and Gemini achieve strong geolocation capability, while open-source models can approach these results with fine-tuning, and that incorporating reference images substantially boosts some models. The study highlights the potential and limitations of LMMs for geolocation and outlines avenues for finer-grained, city- or coordinate-level localization in future work.

Abstract

Image geolocation is a critical task in various image-understanding applications. However, existing methods often fail when analyzing challenging, in-the-wild images. Inspired by the exceptional background knowledge of multimodal language models, we systematically evaluate their geolocation capabilities using a novel image dataset and a comprehensive evaluation framework. We first collect images from various countries via Google Street View. Then, we conduct training-free and training-based evaluations on closed-source and open-source multi-modal language models. we conduct both training-free and training-based evaluations on closed-source and open-source multimodal language models. Our findings indicate that closed-source models demonstrate superior geolocation abilities, while open-source models can achieve comparable performance through fine-tuning.
Paper Structure (10 sections, 3 figures, 5 tables)

This paper contains 10 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Image samples from the test set.
  • Figure 2: The five images are used as fixed input, including their order, for the static few shots strategy.
  • Figure 3: Images samples about dynamic few shots strategy. The first image is the target image, which is for LLMs to guess where it was taken, and the following images on the same row are their corresponding five most similar images based on CLIP embeddings ordered by Euclidean distance descending.