ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis
Yun Chang, Leonor Fermoselle, Duy Ta, Bernadette Bucher, Luca Carlone, Jiuguang Wang
TL;DR
ASHiTA tackles grounding high-level natural-language tasks to embodied 3D scenes by coupling hierarchical task analysis with a task-driven 3D scene graph. It introduces the Hierarchical Information Bottleneck (H-IB) to produce multi-resolution scene representations that align with a task hierarchy, and it alternates bottom-up scene grounding with top-down HTA refinement. The framework demonstrates improvements over zero-shot grounding baselines and competitive performance in generating grounded task hierarchies, with genuine real-world demonstrations and comprehensive ablations. This approach advances embodied planning by tightly integrating language, tasks, and structured scene understanding in open-ended environments.
Abstract
While recent work in scene reconstruction and understanding has made strides in grounding natural language to physical 3D environments, it is still challenging to ground abstract, high-level instructions to a 3D scene. High-level instructions might not explicitly invoke semantic elements in the scene, and even the process of breaking a high-level task into a set of more concrete subtasks, a process called hierarchical task analysis, is environment-dependent. In this work, we propose ASHiTA, the first framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks. ASHiTA alternates LLM-assisted hierarchical task analysis, to generate the task breakdown, with task-driven 3D scene graph construction to generate a suitable representation of the environment. Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks and is additionally able to achieve grounding performance comparable to state-of-the-art methods.
