CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space
Yong Zhao, Kai Xu, Zhengqiu Zhu, Yue Hu, Zhiheng Zheng, Yingfeng Chen, Yatai Ji, Chen Gao, Yong Li, Jincai Huang
TL;DR
CityEQA tackles open-ended embodied question answering in urban environments, introducing CityEQA-EC with 1,412 tasks and a hierarchical PMA agent to handle long-horizon planning. PMA leverages an Planner-Manager-Actor architecture, an object-centric cognitive map, GroundSAM grounding, and VLM-driven observation collection to perform landmark-based navigation, exploration, and information gathering. Experiments show PMA achieving 60.73% of human accuracy and outperforming several baselines, though a sizable gap to human performance remains, underscoring the need for improved urban visual reasoning. The work establishes a foundation for urban spatial intelligence and provides datasets and code to catalyze future research in outdoor embodied QA.
Abstract
Embodied Question Answering (EQA) has primarily focused on indoor environments, leaving the complexities of urban settings-spanning environment, action, and perception-largely unexplored. To bridge this gap, we introduce CityEQA, a new task where an embodied agent answers open-vocabulary questions through active exploration in dynamic city spaces. To support this task, we present CityEQA-EC, the first benchmark dataset featuring 1,412 human-annotated tasks across six categories, grounded in a realistic 3D urban simulator. Moreover, we propose Planner-Manager-Actor (PMA), a novel agent tailored for CityEQA. PMA enables long-horizon planning and hierarchical task execution: the Planner breaks down the question answering into sub-tasks, the Manager maintains an object-centric cognitive map for spatial reasoning during the process control, and the specialized Actors handle navigation, exploration, and collection sub-tasks. Experiments demonstrate that PMA achieves 60.7% of human-level answering accuracy, significantly outperforming competitive baselines. While promising, the performance gap compared to humans highlights the need for enhanced visual reasoning in CityEQA. This work paves the way for future advancements in urban spatial intelligence. Dataset and code are available at https://github.com/BiluYong/CityEQA.git.
