GHIL-Glue: Hierarchical Control with Filtered Subgoal Images

Kyle B. Hatch; Ashwin Balakrishna; Oier Mees; Suraj Nair; Seohong Park; Blake Wulfe; Masha Itkina; Benjamin Eysenbach; Sergey Levine; Thomas Kollar; Benjamin Burchfiel

GHIL-Glue: Hierarchical Control with Filtered Subgoal Images

Kyle B. Hatch, Ashwin Balakrishna, Oier Mees, Suraj Nair, Seohong Park, Blake Wulfe, Masha Itkina, Benjamin Eysenbach, Sergey Levine, Thomas Kollar, Benjamin Burchfiel

TL;DR

The method, GHIL-Glue, filters out subgoals that do not lead to task progress and improves the robustness of goal-conditioned policies to generated subgoals with harmful visual artifacts, achieving a new state-of-the-art on the CALVIN simulation benchmark for policies using observations from a single RGB camera.

Abstract

Image and video generative models that are pre-trained on Internet-scale data can greatly increase the generalization capacity of robot learning systems. These models can function as high-level planners, generating intermediate subgoals for low-level goal-conditioned policies to reach. However, the performance of these systems can be greatly bottlenecked by the interface between generative models and low-level controllers. For example, generative models may predict photorealistic yet physically infeasible frames that confuse low-level policies. Low-level policies may also be sensitive to subtle visual artifacts in generated goal images. This paper addresses these two facets of generalization, providing an interface to effectively "glue together" language-conditioned image or video prediction models with low-level goal-conditioned policies. Our method, Generative Hierarchical Imitation Learning-Glue (GHIL-Glue), filters out subgoals that do not lead to task progress and improves the robustness of goal-conditioned policies to generated subgoals with harmful visual artifacts. We find in extensive experiments in both simulated and real environments that GHIL-Glue achieves a 25% improvement across several hierarchical models that leverage generative subgoals, achieving a new state-of-the-art on the CALVIN simulation benchmark for policies using observations from a single RGB camera. GHIL-Glue also outperforms other generalist robot policies across 3/4 language-conditioned manipulation tasks testing zero-shot generalization in physical experiments.

GHIL-Glue: Hierarchical Control with Filtered Subgoal Images

TL;DR

Abstract

GHIL-Glue: Hierarchical Control with Filtered Subgoal Images

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)