USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents
Siqi Lai, Yansong Ning, Zirui Yuan, Zhixi Chen, Hao Liu
TL;DR
USTBench introduces the first benchmark designed to evaluate LLMs as urban agents across spatiotemporal reasoning tasks. It combines a realistic interactive city environment (UAgentEnv) with a dual evaluation approach: fine-grained process-based QA over four reasoning abilities and end-to-end downstream task assessments across nine urban tasks. Across thirteen LLMs, the study finds that while LLMs excel at spatiotemporal understanding and forecasting, long-horizon planning and reflective adaptation remain challenging, and generic reasoning pretraining does not consistently outperform non-reasoning baselines in domain-specific urban tasks. The work highlights the need for domain-specialized adaptation methods and paves the way for more adaptable, evidence-driven urban LLM agents and broader smart-city applications.
Abstract
Large language models (LLMs) have shown emerging potential in spatiotemporal reasoning, making them promising candidates for building urban agents that support diverse urban downstream applications. Despite these benefits, existing studies primarily focus on evaluating urban LLM agent on outcome-level metrics (e.g., prediction accuracy, traffic efficiency), offering limited insight into their underlying reasoning processes. As a result, the strengths and limitations of urban LLM agents in spatiotemporal reasoning remain poorly understood. To this end, we introduce USTBench, the first benchmark to evaluate LLMs' spatiotemporal reasoning abilities as urban agents across four decomposed dimensions: spatiotemporal understanding, forecasting, planning, and reflection with feedback. Specifically, USTBench supports five diverse urban decision-making and four spatiotemporal prediction tasks, all running within our constructed interactive city environment UAgentEnv. The benchmark includes 62,466 structured QA pairs for process-level evaluation and standardized end-to-end task assessments, enabling fine-grained diagnostics and broad task-level comparison across diverse urban scenarios. Through extensive evaluation of thirteen leading LLMs, we reveal that although LLMs show promising potential across various urban downstream tasks, they still struggle in long-horizon planning and reflective adaptation in dynamic urban contexts. Notably, recent advanced reasoning models (e.g., DeepSeek-R1) trained on general logic or mathematical problems do not consistently outperform non-reasoning LLMs. This discrepancy highlights the need for domain-specialized adaptation methods to enhance urban spatiotemporal reasoning. Overall, USTBench provides a foundation to build more adaptive and effective LLM-based urban agents and broad smart city applications.
