CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs?
Siqi Wang, Chao Liang, Yunfan Gao, Erxin Yu, Sen Li, Yushi Li, Jing Li, Haofen Wang
TL;DR
CitySeeker introduces the first large-scale benchmark for embodied urban navigation driven by implicit human needs, using 6,440 trajectories across 8 cities and 7 need categories to probe long-horizon reasoning and groundable visual search. The paper reveals significant gaps in current VLMs' ability to translate abstract intents into concrete, multi-step plans, and proposes a triad of human-inspired strategies—Backtracking, Cognitive-map Enrichment, and Memory-Based Retrieval (BCR)—to boost spatial intelligence. Through comprehensive evaluations of 27 models and extensive analyses of error modes and city biases, the work highlights both the limits of current systems and the potential of memory and structured spatial cues to improve last-mile navigation. The findings offer a concrete roadmap for building spatially aware, memory-informed VLMs capable of robustly addressing implicit needs in dynamic urban environments.
Abstract
Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., "I am thirsty") in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs' spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies-Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping's emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling "last-mile" navigation challenges.
