Task-oriented Sequential Grounding and Navigation in 3D Scenes
Zhuofan Zhang, Ziyu Zhu, Junhao Li, Pengxiang Li, Tianxu Wang, Tengyu Liu, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Siyuan Huang, Qing Li
TL;DR
This work introduces Task-oriented Sequential Grounding and Navigation in 3D Scenes (SG3D), a benchmark addressing the dynamic, multi-step nature of real-world daily tasks in indoor environments. It pairs a large-scale dataset (SG3D) with a sequential grounding framework (SG-LLM) to address grounding and navigation across ordered task steps, revealing that current models struggle to capture cross-step context. The SG-LLM model, leveraging a sequential adapter and stepwise grounding, achieves state-of-the-art performance on sequential grounding after fine-tuning, though task-level accuracy remains below 40%, indicating substantial room for improvement. The paper also compares modular and end-to-end navigation strategies, showing end-to-end policies benefit most from sequential context and fine-tuning. Collectively, SG3D and SG-LLM push the field toward more capable, context-aware embodied agents capable of planning and acting across extended task sequences in 3D environments.
Abstract
Grounding natural language in 3D environments is a critical step toward achieving robust 3D vision-language alignment. Current datasets and models for 3D visual grounding predominantly focus on identifying and localizing objects from static, object-centric descriptions. These approaches do not adequately address the dynamic and sequential nature of task-oriented scenarios. In this work, we introduce a novel task: Task-oriented Sequential Grounding and Navigation in 3D Scenes, where models must interpret step-by-step instructions for daily activities by either localizing a sequence of target objects in indoor scenes or navigating toward them within a 3D simulator. To facilitate this task, we present SG3D, a large-scale dataset comprising 22,346 tasks with 112,236 steps across 4,895 real-world 3D scenes. The dataset is constructed by combining RGB-D scans from various 3D scene datasets with an automated task generation pipeline, followed by human verification for quality assurance. We benchmark contemporary methods on SG3D, revealing the significant challenges in understanding task-oriented context across multiple steps. Furthermore, we propose SG-LLM, a state-of-the-art approach leveraging a stepwise grounding paradigm to tackle the sequential grounding task. Our findings underscore the need for further research to advance the development of more capable and context-aware embodied agents.
