Table of Contents
Fetching ...

HELIOS: Hierarchical Exploration for Language-Grounded Interaction in Open Scenes

Katrina Ashton, Chahyon Ku, Shrey Shah, Saumit Vedula, Tingrui Zhang, Wen Jiang, Kostas Daniilidis, Bernadette Bucher

Abstract

Language-specified mobile manipulation tasks in novel environments simultaneously face challenges interacting with a scene which is only partially observed, grounding semantic information from language instructions to the partially observed scene, and actively updating knowledge of the scene with new observations. To address these challenges, we propose HELIOS, a hierarchical scene representation and associated search objective. We construct 2D maps containing the relevant semantic and occupancy information for navigation while simultaneously actively constructing 3D Gaussian representations of task-relevant objects. We fuse observations across this multi-layered representation while explicitly modeling the multi-view consistency of the detections of each object using the Dirichlet distribution. Planning is formulated as a search problem over our hierarchical representation. We formulate an objective that jointly considers (i) exploration of unobserved or uncertain regions of the environment and (ii) information gathering from additional observations of candidate objects. This objective integrates frontier-based exploration with the expected information gain associated with improving semantic consistency of object detections. We evaluate HELIOS on the OVMM benchmark in the Habitat simulator, a pick and place benchmark in which perception is challenging due to large and complex scenes with comparatively small target objects. HELIOS achieves state-of-the-art results on OVMM. We demonstrate HELIOS performing language specified pick and place in a real world office environment on a Spot robot. Our method leverages pretrained VLMs to achieve these results in simulation and the real world without any task specific training.

HELIOS: Hierarchical Exploration for Language-Grounded Interaction in Open Scenes

Abstract

Language-specified mobile manipulation tasks in novel environments simultaneously face challenges interacting with a scene which is only partially observed, grounding semantic information from language instructions to the partially observed scene, and actively updating knowledge of the scene with new observations. To address these challenges, we propose HELIOS, a hierarchical scene representation and associated search objective. We construct 2D maps containing the relevant semantic and occupancy information for navigation while simultaneously actively constructing 3D Gaussian representations of task-relevant objects. We fuse observations across this multi-layered representation while explicitly modeling the multi-view consistency of the detections of each object using the Dirichlet distribution. Planning is formulated as a search problem over our hierarchical representation. We formulate an objective that jointly considers (i) exploration of unobserved or uncertain regions of the environment and (ii) information gathering from additional observations of candidate objects. This objective integrates frontier-based exploration with the expected information gain associated with improving semantic consistency of object detections. We evaluate HELIOS on the OVMM benchmark in the Habitat simulator, a pick and place benchmark in which perception is challenging due to large and complex scenes with comparatively small target objects. HELIOS achieves state-of-the-art results on OVMM. We demonstrate HELIOS performing language specified pick and place in a real world office environment on a Spot robot. Our method leverages pretrained VLMs to achieve these results in simulation and the real world without any task specific training.

Paper Structure

This paper contains 18 sections, 12 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Our hierarchical scene representation.
  • Figure 2: Example of multi-view fusion. We show two observations, in the first a toy rocket is incorrectly identified as a knife and the table is correctly identified, in the second the table is again correctly identified. Right of this we show the change in the semantic probability for each class in the 3DGS part of our scene representation when it is updated with the second detection. We can see that the incorrect detection of the object on the table as a knife is not multi-view consistent and so the probability of this object being a knife goes down when we include the second detection. The table is correctly detected across multiple frames so the probability goes up after fusion.
  • Figure 3: Method flow chart for HELIOS.
  • Figure 4: Hardware experiments set-up.
  • Figure 5: Hardware results. Success rates of subtask performance for HELIOS and trusting agent baseline represented as stacked bar plots. The lowest bar in each column represents the rate of successfully placing the object, which is the overall success at the task, while the other bars show the success rate at the earlier subtasks.
  • ...and 5 more figures