Task-oriented Sequential Grounding and Navigation in 3D Scenes

Zhuofan Zhang; Ziyu Zhu; Junhao Li; Pengxiang Li; Tianxu Wang; Tengyu Liu; Xiaojian Ma; Yixin Chen; Baoxiong Jia; Siyuan Huang; Qing Li

Task-oriented Sequential Grounding and Navigation in 3D Scenes

Zhuofan Zhang, Ziyu Zhu, Junhao Li, Pengxiang Li, Tianxu Wang, Tengyu Liu, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Siyuan Huang, Qing Li

TL;DR

This work introduces Task-oriented Sequential Grounding and Navigation in 3D Scenes (SG3D), a benchmark addressing the dynamic, multi-step nature of real-world daily tasks in indoor environments. It pairs a large-scale dataset (SG3D) with a sequential grounding framework (SG-LLM) to address grounding and navigation across ordered task steps, revealing that current models struggle to capture cross-step context. The SG-LLM model, leveraging a sequential adapter and stepwise grounding, achieves state-of-the-art performance on sequential grounding after fine-tuning, though task-level accuracy remains below 40%, indicating substantial room for improvement. The paper also compares modular and end-to-end navigation strategies, showing end-to-end policies benefit most from sequential context and fine-tuning. Collectively, SG3D and SG-LLM push the field toward more capable, context-aware embodied agents capable of planning and acting across extended task sequences in 3D environments.

Abstract

Grounding natural language in 3D environments is a critical step toward achieving robust 3D vision-language alignment. Current datasets and models for 3D visual grounding predominantly focus on identifying and localizing objects from static, object-centric descriptions. These approaches do not adequately address the dynamic and sequential nature of task-oriented scenarios. In this work, we introduce a novel task: Task-oriented Sequential Grounding and Navigation in 3D Scenes, where models must interpret step-by-step instructions for daily activities by either localizing a sequence of target objects in indoor scenes or navigating toward them within a 3D simulator. To facilitate this task, we present SG3D, a large-scale dataset comprising 22,346 tasks with 112,236 steps across 4,895 real-world 3D scenes. The dataset is constructed by combining RGB-D scans from various 3D scene datasets with an automated task generation pipeline, followed by human verification for quality assurance. We benchmark contemporary methods on SG3D, revealing the significant challenges in understanding task-oriented context across multiple steps. Furthermore, we propose SG-LLM, a state-of-the-art approach leveraging a stepwise grounding paradigm to tackle the sequential grounding task. Our findings underscore the need for further research to advance the development of more capable and context-aware embodied agents.

Task-oriented Sequential Grounding and Navigation in 3D Scenes

TL;DR

Abstract

Paper Structure (29 sections, 4 equations, 16 figures, 5 tables)

This paper contains 29 sections, 4 equations, 16 figures, 5 tables.

Introduction
Related Work
3D Sequential Grounding and Navigation
Problem Formulation
Dataset Construction
Dataset Analysis
Sequential Grounding Methods
Baselines
SG-LLM
Sequential Navigation Methods
Experiments and Results
Evaluation Metrics
Quantitative Results & Analysis
Results on Sequential Grounding Benchmark
Results on Sequential Navigation Benchmark
...and 14 more sections

Figures (16)

Figure 1: The task-oriented sequential grounding and navigation task in 3D scenes (SG3D), wherein models are required to interpret step-by-step instructions for daily activities by either localizing a sequence of target objects in indoor scenes or navigating toward them within a 3D simulator. To solve this task, models must understand each step in the sequential context to identify the target object, since a single step alone can be insufficient to distinguish the target from other objects of the same category.
Figure 2: The comparison between task-oriented steps in SG3D (first row) and object-centric referrals in ScanRefer (second row) for the same target objects.
Figure 3: Data collection pipeline.
Figure 4: Distributions of (a) text length (by words) per task, and (b) the number of steps per task.
Figure 5: The structure of SG-LLM.
...and 11 more figures

Task-oriented Sequential Grounding and Navigation in 3D Scenes

TL;DR

Abstract

Task-oriented Sequential Grounding and Navigation in 3D Scenes

Authors

TL;DR

Abstract

Table of Contents

Figures (16)