Table of Contents
Fetching ...

ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis

Yun Chang, Leonor Fermoselle, Duy Ta, Bernadette Bucher, Luca Carlone, Jiuguang Wang

TL;DR

ASHiTA tackles grounding high-level natural-language tasks to embodied 3D scenes by coupling hierarchical task analysis with a task-driven 3D scene graph. It introduces the Hierarchical Information Bottleneck (H-IB) to produce multi-resolution scene representations that align with a task hierarchy, and it alternates bottom-up scene grounding with top-down HTA refinement. The framework demonstrates improvements over zero-shot grounding baselines and competitive performance in generating grounded task hierarchies, with genuine real-world demonstrations and comprehensive ablations. This approach advances embodied planning by tightly integrating language, tasks, and structured scene understanding in open-ended environments.

Abstract

While recent work in scene reconstruction and understanding has made strides in grounding natural language to physical 3D environments, it is still challenging to ground abstract, high-level instructions to a 3D scene. High-level instructions might not explicitly invoke semantic elements in the scene, and even the process of breaking a high-level task into a set of more concrete subtasks, a process called hierarchical task analysis, is environment-dependent. In this work, we propose ASHiTA, the first framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks. ASHiTA alternates LLM-assisted hierarchical task analysis, to generate the task breakdown, with task-driven 3D scene graph construction to generate a suitable representation of the environment. Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks and is additionally able to achieve grounding performance comparable to state-of-the-art methods.

ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis

TL;DR

ASHiTA tackles grounding high-level natural-language tasks to embodied 3D scenes by coupling hierarchical task analysis with a task-driven 3D scene graph. It introduces the Hierarchical Information Bottleneck (H-IB) to produce multi-resolution scene representations that align with a task hierarchy, and it alternates bottom-up scene grounding with top-down HTA refinement. The framework demonstrates improvements over zero-shot grounding baselines and competitive performance in generating grounded task hierarchies, with genuine real-world demonstrations and comprehensive ablations. This approach advances embodied planning by tightly integrating language, tasks, and structured scene understanding in open-ended environments.

Abstract

While recent work in scene reconstruction and understanding has made strides in grounding natural language to physical 3D environments, it is still challenging to ground abstract, high-level instructions to a 3D scene. High-level instructions might not explicitly invoke semantic elements in the scene, and even the process of breaking a high-level task into a set of more concrete subtasks, a process called hierarchical task analysis, is environment-dependent. In this work, we propose ASHiTA, the first framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks. ASHiTA alternates LLM-assisted hierarchical task analysis, to generate the task breakdown, with task-driven 3D scene graph construction to generate a suitable representation of the environment. Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks and is additionally able to achieve grounding performance comparable to state-of-the-art methods.

Paper Structure

This paper contains 21 sections, 48 equations, 12 figures, 16 tables.

Figures (12)

  • Figure 1: Given the high-level task of "Prepare for dinner", ASHiTA automatically generates a hierarchy of subtasks (and items) while grounding them to a 3D scene graph. For the scene graph in the figure, the blue node corresponds to the high-level task, magenta nodes correspond to the subtasks, and red nodes correspond to the items required by the subtasks.
  • Figure 2: ASHiTA first segments and encodes primitives in 2D, and then associates and optimizes them in 3D together with the camera poses. ASHiTA then breaks down high-level tasks into a task hierarchy by alternating two steps: a Scene Hierarchy Update (Section \ref{['sec:scene_hierarchy']}) which creates a 3D scene graph from the primitives layer using the task hierarchy, and a Task Update (Section \ref{['sec:task_update']}) which uses an LLM and the 3D scene graph to refine the task hierarchy.
  • Figure 3: ASHiTA's Scene Hierarchy and Task Update steps. The task hierarchy is on the left with diamond-shaped nodes representing the task entities. The scene graph is on the right, with circles marking the task-aligned scene graph nodes and green boxes marking the primitives layer. (a) Bottom-Up Construction: Starting from an initial task hierarchy, we perform H-IB and use the result to construct a 3D scene graph. (b) Top-Down Pruning: We perform pruning using the probabilities obtained from H-IB and also prune nodes related to the null tasks. (c) Spatial Update: Using the scene graph, we can update the spatial locations of the tasks, subtasks, and items. (d) Hierarchy Refinement: With the suggested items given by H-IB, we query the LLM to refine the task hierarchy.
  • Figure 4: ASHiTA demonstrated in a real-world seminar room and snack bar on a robot, given two high-level tasks. Blue denotes the high-level tasks, magenta the decomposed subtasks, and red the items.
  • Figure 5: ASHiTA demonstrated in a real-world hardware workshop environment on a robot given 4 high-level tasks.
  • ...and 7 more figures