GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization
Pengyue Jia, Yingyi Zhang, Xiangyu Zhao, Sharon Li
TL;DR
GeoArena introduces an open, dynamic benchmark for evaluating large vision-language models on worldwide image geolocalization using in-the-wild user submissions and pairwise human judgments. It addresses data leakage and privacy concerns inherent to static GPS-based benchmarks by replacing ground-truth coordinates with human preferences. The framework relies on a Bradley-Terry model to estimate latent model strengths from pairwise votes and converts them into Elo-style scores, with bootstrap confidence intervals to quantify uncertainty. GeoArena-1K provides a dataset of prompts, images, model outputs, and voting outcomes to support research in reward modeling and geographic foundation models. Empirical results show frontier models (Gemini 2.5 pro/flash) leading the leaderboard, open-source variants narrowing the gap, and findings that highlight the importance of response reasoning style for user satisfaction.
Abstract
Image geolocalization aims to predict the geographic location of images captured anywhere on Earth, but its global nature presents significant challenges. Current evaluation methodologies suffer from two major limitations. First, data leakage: advanced approaches often rely on large vision-language models (LVLMs) to predict image locations, yet these models are frequently pretrained on the test datasets, compromising the accuracy of evaluating a model's actual geolocalization capability. Second, existing metrics primarily rely on exact geographic coordinates to assess predictions, which not only neglects the reasoning process but also raises privacy concerns when user-level location data is required. To address these issues, we propose GeoArena, a first open platform for evaluating LVLMs on worldwide image geolocalization tasks, offering true in-the-wild and human-centered benchmarking. GeoArena enables users to upload in-the-wild images for a more diverse evaluation corpus, and it leverages pairwise human judgments to determine which model output better aligns with human expectations. Our platform has been deployed online for two months, during which we collected over thousands voting records. Based on this data, we conduct a detailed analysis and establish a leaderboard of different LVLMs on the image geolocalization task. GeoArena has been open-sourced to support future research.
