MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

Zhiheng Song; Jingshuai Zhang; Chuan Qin; Chao Wang; Chao Chen; Longfei Xu; Kaikui Liu; Xiangxiang Chu; Hengshu Zhu

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

Zhiheng Song, Jingshuai Zhang, Chuan Qin, Chao Wang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu, Hengshu Zhu

TL;DR

The findings reveal that current LLM-based route-planning agents perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications.

Abstract

Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench .

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

TL;DR

Abstract

Paper Structure (31 sections, 5 equations, 4 figures, 4 tables)

This paper contains 31 sections, 5 equations, 4 figures, 4 tables.

Introduction
Related Work
Route Planning in Urban Computing
Tool-augmented Agent Benchmark
MobilityBench
Benchmark Construction
Episode-centric Formulation.
Data Collection and Task Taxonomy Construction
Ground-Truth Construction.
Deterministic Replay Sandbox.
Dataset Statistics.
Evaluation Protocol
Instruction Understanding
Planning
Tool Use
...and 16 more sections

Figures (4)

Figure 1: Overview of MobilityBench, a systematic benchmark for evaluating route-planning agents.
Figure 2: Global coverage of MobilityBench Data.
Figure 3: Performance across four high-level task families.
Figure 4: Final pass rate comparison (Thinking vs. Non-thinking) under the Plan-and-Execute framework.

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

TL;DR

Abstract

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

Authors

TL;DR

Abstract

Table of Contents

Figures (4)