Table of Contents
Fetching ...

Task-oriented Sequential Grounding and Navigation in 3D Scenes

Zhuofan Zhang, Ziyu Zhu, Junhao Li, Pengxiang Li, Tianxu Wang, Tengyu Liu, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Siyuan Huang, Qing Li

TL;DR

This work introduces Task-oriented Sequential Grounding and Navigation in 3D Scenes (SG3D), a benchmark addressing the dynamic, multi-step nature of real-world daily tasks in indoor environments. It pairs a large-scale dataset (SG3D) with a sequential grounding framework (SG-LLM) to address grounding and navigation across ordered task steps, revealing that current models struggle to capture cross-step context. The SG-LLM model, leveraging a sequential adapter and stepwise grounding, achieves state-of-the-art performance on sequential grounding after fine-tuning, though task-level accuracy remains below 40%, indicating substantial room for improvement. The paper also compares modular and end-to-end navigation strategies, showing end-to-end policies benefit most from sequential context and fine-tuning. Collectively, SG3D and SG-LLM push the field toward more capable, context-aware embodied agents capable of planning and acting across extended task sequences in 3D environments.

Abstract

Grounding natural language in 3D environments is a critical step toward achieving robust 3D vision-language alignment. Current datasets and models for 3D visual grounding predominantly focus on identifying and localizing objects from static, object-centric descriptions. These approaches do not adequately address the dynamic and sequential nature of task-oriented scenarios. In this work, we introduce a novel task: Task-oriented Sequential Grounding and Navigation in 3D Scenes, where models must interpret step-by-step instructions for daily activities by either localizing a sequence of target objects in indoor scenes or navigating toward them within a 3D simulator. To facilitate this task, we present SG3D, a large-scale dataset comprising 22,346 tasks with 112,236 steps across 4,895 real-world 3D scenes. The dataset is constructed by combining RGB-D scans from various 3D scene datasets with an automated task generation pipeline, followed by human verification for quality assurance. We benchmark contemporary methods on SG3D, revealing the significant challenges in understanding task-oriented context across multiple steps. Furthermore, we propose SG-LLM, a state-of-the-art approach leveraging a stepwise grounding paradigm to tackle the sequential grounding task. Our findings underscore the need for further research to advance the development of more capable and context-aware embodied agents.

Task-oriented Sequential Grounding and Navigation in 3D Scenes

TL;DR

This work introduces Task-oriented Sequential Grounding and Navigation in 3D Scenes (SG3D), a benchmark addressing the dynamic, multi-step nature of real-world daily tasks in indoor environments. It pairs a large-scale dataset (SG3D) with a sequential grounding framework (SG-LLM) to address grounding and navigation across ordered task steps, revealing that current models struggle to capture cross-step context. The SG-LLM model, leveraging a sequential adapter and stepwise grounding, achieves state-of-the-art performance on sequential grounding after fine-tuning, though task-level accuracy remains below 40%, indicating substantial room for improvement. The paper also compares modular and end-to-end navigation strategies, showing end-to-end policies benefit most from sequential context and fine-tuning. Collectively, SG3D and SG-LLM push the field toward more capable, context-aware embodied agents capable of planning and acting across extended task sequences in 3D environments.

Abstract

Grounding natural language in 3D environments is a critical step toward achieving robust 3D vision-language alignment. Current datasets and models for 3D visual grounding predominantly focus on identifying and localizing objects from static, object-centric descriptions. These approaches do not adequately address the dynamic and sequential nature of task-oriented scenarios. In this work, we introduce a novel task: Task-oriented Sequential Grounding and Navigation in 3D Scenes, where models must interpret step-by-step instructions for daily activities by either localizing a sequence of target objects in indoor scenes or navigating toward them within a 3D simulator. To facilitate this task, we present SG3D, a large-scale dataset comprising 22,346 tasks with 112,236 steps across 4,895 real-world 3D scenes. The dataset is constructed by combining RGB-D scans from various 3D scene datasets with an automated task generation pipeline, followed by human verification for quality assurance. We benchmark contemporary methods on SG3D, revealing the significant challenges in understanding task-oriented context across multiple steps. Furthermore, we propose SG-LLM, a state-of-the-art approach leveraging a stepwise grounding paradigm to tackle the sequential grounding task. Our findings underscore the need for further research to advance the development of more capable and context-aware embodied agents.
Paper Structure (29 sections, 4 equations, 16 figures, 5 tables)

This paper contains 29 sections, 4 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: The task-oriented sequential grounding and navigation task in 3D scenes (SG3D), wherein models are required to interpret step-by-step instructions for daily activities by either localizing a sequence of target objects in indoor scenes or navigating toward them within a 3D simulator. To solve this task, models must understand each step in the sequential context to identify the target object, since a single step alone can be insufficient to distinguish the target from other objects of the same category.
  • Figure 2: The comparison between task-oriented steps in SG3D (first row) and object-centric referrals in ScanRefer (second row) for the same target objects.
  • Figure 3: Data collection pipeline.
  • Figure 4: Distributions of (a) text length (by words) per task, and (b) the number of steps per task.
  • Figure 5: The structure of SG-LLM.
  • ...and 11 more figures