CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks
Jie Feng, Jun Zhang, Tianhui Liu, Xin Zhang, Tianjian Ouyang, Junbo Yan, Yuwei Du, Siqi Guo, Yong Li
TL;DR
CityBench provides an end-to-end, simulator-based benchmark to evaluate LLMs and VLMs on urban tasks across 13 cities and 8 tasks, harnessing CityData and CitySimu for data and dynamics. It reveals that while models excel at commonsense and semantic urban understanding, they struggle with professional knowledge and precise numerical tasks such as geospatial prediction and traffic control, and exhibit geospatial bias across cities. The study shows that VLMs rely heavily on the LLM backbone, and that many error modes (e.g., misformatting, refusals, hallucinations) hinder practical deployment. Overall, CityBench highlights the need for urban-domain specialized models and scalable, global evaluation frameworks to advance AI for city-scale tasks.
Abstract
As large language models (LLMs) continue to advance and gain widespread use, establishing systematic and reliable evaluation methodologies for LLMs and vision-language models (VLMs) has become essential to ensure their real-world effectiveness and reliability. There have been some early explorations about the usability of LLMs for limited urban tasks, but a systematic and scalable evaluation benchmark is still lacking. The challenge in constructing a systematic evaluation benchmark for urban research lies in the diversity of urban data, the complexity of application scenarios and the highly dynamic nature of the urban environment. In this paper, we design \textit{CityBench}, an interactive simulator based evaluation platform, as the first systematic benchmark for evaluating the capabilities of LLMs for diverse tasks in urban research. First, we build \textit{CityData} to integrate the diverse urban data and \textit{CitySimu} to simulate fine-grained urban dynamics. Based on \textit{CityData} and \textit{CitySimu}, we design 8 representative urban tasks in 2 categories of perception-understanding and decision-making as the \textit{CityBench}. With extensive results from 30 well-known LLMs and VLMs in 13 cities around the world, we find that advanced LLMs and VLMs can achieve competitive performance in diverse urban tasks requiring commonsense and semantic understanding abilities, e.g., understanding the human dynamics and semantic inference of urban images. Meanwhile, they fail to solve the challenging urban tasks requiring professional knowledge and high-level numerical abilities, e.g., geospatial prediction and traffic control task.
