Table of Contents
Fetching ...

CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space

Yong Zhao, Kai Xu, Zhengqiu Zhu, Yue Hu, Zhiheng Zheng, Yingfeng Chen, Yatai Ji, Chen Gao, Yong Li, Jincai Huang

TL;DR

CityEQA tackles open-ended embodied question answering in urban environments, introducing CityEQA-EC with 1,412 tasks and a hierarchical PMA agent to handle long-horizon planning. PMA leverages an Planner-Manager-Actor architecture, an object-centric cognitive map, GroundSAM grounding, and VLM-driven observation collection to perform landmark-based navigation, exploration, and information gathering. Experiments show PMA achieving 60.73% of human accuracy and outperforming several baselines, though a sizable gap to human performance remains, underscoring the need for improved urban visual reasoning. The work establishes a foundation for urban spatial intelligence and provides datasets and code to catalyze future research in outdoor embodied QA.

Abstract

Embodied Question Answering (EQA) has primarily focused on indoor environments, leaving the complexities of urban settings-spanning environment, action, and perception-largely unexplored. To bridge this gap, we introduce CityEQA, a new task where an embodied agent answers open-vocabulary questions through active exploration in dynamic city spaces. To support this task, we present CityEQA-EC, the first benchmark dataset featuring 1,412 human-annotated tasks across six categories, grounded in a realistic 3D urban simulator. Moreover, we propose Planner-Manager-Actor (PMA), a novel agent tailored for CityEQA. PMA enables long-horizon planning and hierarchical task execution: the Planner breaks down the question answering into sub-tasks, the Manager maintains an object-centric cognitive map for spatial reasoning during the process control, and the specialized Actors handle navigation, exploration, and collection sub-tasks. Experiments demonstrate that PMA achieves 60.7% of human-level answering accuracy, significantly outperforming competitive baselines. While promising, the performance gap compared to humans highlights the need for enhanced visual reasoning in CityEQA. This work paves the way for future advancements in urban spatial intelligence. Dataset and code are available at https://github.com/BiluYong/CityEQA.git.

CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space

TL;DR

CityEQA tackles open-ended embodied question answering in urban environments, introducing CityEQA-EC with 1,412 tasks and a hierarchical PMA agent to handle long-horizon planning. PMA leverages an Planner-Manager-Actor architecture, an object-centric cognitive map, GroundSAM grounding, and VLM-driven observation collection to perform landmark-based navigation, exploration, and information gathering. Experiments show PMA achieving 60.73% of human accuracy and outperforming several baselines, though a sizable gap to human performance remains, underscoring the need for improved urban visual reasoning. The work establishes a foundation for urban spatial intelligence and provides datasets and code to catalyze future research in outdoor embodied QA.

Abstract

Embodied Question Answering (EQA) has primarily focused on indoor environments, leaving the complexities of urban settings-spanning environment, action, and perception-largely unexplored. To bridge this gap, we introduce CityEQA, a new task where an embodied agent answers open-vocabulary questions through active exploration in dynamic city spaces. To support this task, we present CityEQA-EC, the first benchmark dataset featuring 1,412 human-annotated tasks across six categories, grounded in a realistic 3D urban simulator. Moreover, we propose Planner-Manager-Actor (PMA), a novel agent tailored for CityEQA. PMA enables long-horizon planning and hierarchical task execution: the Planner breaks down the question answering into sub-tasks, the Manager maintains an object-centric cognitive map for spatial reasoning during the process control, and the specialized Actors handle navigation, exploration, and collection sub-tasks. Experiments demonstrate that PMA achieves 60.7% of human-level answering accuracy, significantly outperforming competitive baselines. While promising, the performance gap compared to humans highlights the need for enhanced visual reasoning in CityEQA. This work paves the way for future advancements in urban spatial intelligence. Dataset and code are available at https://github.com/BiluYong/CityEQA.git.

Paper Structure

This paper contains 44 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: The typical workflow of the PMA to address City EQA tasks. There are two cars in this area, thus a valid question must contain landmarks and spatial relationships to specify a car. Given the task, PMA will sequentially complete multiple sub-tasks to find the answer.
  • Figure 2: Task examples and dataset statistics of the CityEQA-EC.
  • Figure 3: The overview of our proposed PMA agent.
  • Figure 4: The performance of the Collector module at different steps.
  • Figure 5: The collection and validation process of the CityEQA dataset.
  • ...and 8 more figures