Table of Contents
Fetching ...

CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks

Jie Feng, Jun Zhang, Tianhui Liu, Xin Zhang, Tianjian Ouyang, Junbo Yan, Yuwei Du, Siqi Guo, Yong Li

TL;DR

CityBench provides an end-to-end, simulator-based benchmark to evaluate LLMs and VLMs on urban tasks across 13 cities and 8 tasks, harnessing CityData and CitySimu for data and dynamics. It reveals that while models excel at commonsense and semantic urban understanding, they struggle with professional knowledge and precise numerical tasks such as geospatial prediction and traffic control, and exhibit geospatial bias across cities. The study shows that VLMs rely heavily on the LLM backbone, and that many error modes (e.g., misformatting, refusals, hallucinations) hinder practical deployment. Overall, CityBench highlights the need for urban-domain specialized models and scalable, global evaluation frameworks to advance AI for city-scale tasks.

Abstract

As large language models (LLMs) continue to advance and gain widespread use, establishing systematic and reliable evaluation methodologies for LLMs and vision-language models (VLMs) has become essential to ensure their real-world effectiveness and reliability. There have been some early explorations about the usability of LLMs for limited urban tasks, but a systematic and scalable evaluation benchmark is still lacking. The challenge in constructing a systematic evaluation benchmark for urban research lies in the diversity of urban data, the complexity of application scenarios and the highly dynamic nature of the urban environment. In this paper, we design \textit{CityBench}, an interactive simulator based evaluation platform, as the first systematic benchmark for evaluating the capabilities of LLMs for diverse tasks in urban research. First, we build \textit{CityData} to integrate the diverse urban data and \textit{CitySimu} to simulate fine-grained urban dynamics. Based on \textit{CityData} and \textit{CitySimu}, we design 8 representative urban tasks in 2 categories of perception-understanding and decision-making as the \textit{CityBench}. With extensive results from 30 well-known LLMs and VLMs in 13 cities around the world, we find that advanced LLMs and VLMs can achieve competitive performance in diverse urban tasks requiring commonsense and semantic understanding abilities, e.g., understanding the human dynamics and semantic inference of urban images. Meanwhile, they fail to solve the challenging urban tasks requiring professional knowledge and high-level numerical abilities, e.g., geospatial prediction and traffic control task.

CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks

TL;DR

CityBench provides an end-to-end, simulator-based benchmark to evaluate LLMs and VLMs on urban tasks across 13 cities and 8 tasks, harnessing CityData and CitySimu for data and dynamics. It reveals that while models excel at commonsense and semantic urban understanding, they struggle with professional knowledge and precise numerical tasks such as geospatial prediction and traffic control, and exhibit geospatial bias across cities. The study shows that VLMs rely heavily on the LLM backbone, and that many error modes (e.g., misformatting, refusals, hallucinations) hinder practical deployment. Overall, CityBench highlights the need for urban-domain specialized models and scalable, global evaluation frameworks to advance AI for city-scale tasks.

Abstract

As large language models (LLMs) continue to advance and gain widespread use, establishing systematic and reliable evaluation methodologies for LLMs and vision-language models (VLMs) has become essential to ensure their real-world effectiveness and reliability. There have been some early explorations about the usability of LLMs for limited urban tasks, but a systematic and scalable evaluation benchmark is still lacking. The challenge in constructing a systematic evaluation benchmark for urban research lies in the diversity of urban data, the complexity of application scenarios and the highly dynamic nature of the urban environment. In this paper, we design \textit{CityBench}, an interactive simulator based evaluation platform, as the first systematic benchmark for evaluating the capabilities of LLMs for diverse tasks in urban research. First, we build \textit{CityData} to integrate the diverse urban data and \textit{CitySimu} to simulate fine-grained urban dynamics. Based on \textit{CityData} and \textit{CitySimu}, we design 8 representative urban tasks in 2 categories of perception-understanding and decision-making as the \textit{CityBench}. With extensive results from 30 well-known LLMs and VLMs in 13 cities around the world, we find that advanced LLMs and VLMs can achieve competitive performance in diverse urban tasks requiring commonsense and semantic understanding abilities, e.g., understanding the human dynamics and semantic inference of urban images. Meanwhile, they fail to solve the challenging urban tasks requiring professional knowledge and high-level numerical abilities, e.g., geospatial prediction and traffic control task.
Paper Structure (28 sections, 7 figures, 6 tables)

This paper contains 28 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The framework of CityBench, which consists of a data collector CityData, an activity simulator CitySimu and 8 diverse urban tasks with different modalities. The evaluation data in the benchmark is collected from 13 cities around the world.
  • Figure 2: The pipeline of building benchmark, including data collection stage, data integration stage, evaluation generation stage and quality control stage.
  • Figure 3: The simulation framework of CitySimu, including base environment APIs, interactive objects, simulation APIs and language APIs. Besides, supported task examples also present the relation between simulation APIs and evaluation tasks.
  • Figure 4: 8 tasks in CityBench with their metrics.
  • Figure 5: Detailed performance results of LLMs on two tasks: (top) mobility prediction and (bottom) image geolocalization. Both tasks are evaluated across multiple cities and multiple models, demonstrating that significant performance variations across diverse urban contexts are consistently observed even with different model architectures, highlighting the pervasive nature of geospatial bias in these models.
  • ...and 2 more figures