Table of Contents
Fetching ...

SEAL: SEmantic-Augmented Imitation Learning via Language Model

Chengyang Gu, Yuxin Pan, Haotian Bai, Hui Xiong, Yize Chen

TL;DR

SEAL addresses long-horizon hierarchical imitation learning by using pretrained LLMs to define semantically meaningful sub-goals and pre-label states, eliminating the need for task-specific hierarchies. It introduces a dual-encoder Sub-goal Learner (LLM-supervised and unsupervised VQ) and a transition-augmented low-level policy to emphasize critical sub-goal transitions during imitation. End-to-end training optimizes a weighted combination of high- and low-level objectives, with dynamic confidences guiding the integration of the encoders. Empirical results on KeyDoor and Grid-World show SEAL outperforming BC, LISA, SDIL, and Thought Cloning, especially in low-data and longer-horizon scenarios, demonstrating robust sub-goal discovery, better transition handling, and improved generalization to task variations.

Abstract

Hierarchical Imitation Learning (HIL) is a promising approach for tackling long-horizon decision-making tasks. While it is a challenging task due to the lack of detailed supervisory labels for sub-goal learning, and reliance on hundreds to thousands of expert demonstrations. In this work, we introduce SEAL, a novel framework that leverages Large Language Models (LLMs)'s powerful semantic and world knowledge for both specifying sub-goal space and pre-labeling states to semantically meaningful sub-goal representations without prior knowledge of task hierarchies. SEAL employs a dual-encoder structure, combining supervised LLM-guided sub-goal learning with unsupervised Vector Quantization (VQ) for more robust sub-goal representations. Additionally, SEAL incorporates a transition-augmented low-level planner for improved adaptation to sub-goal transitions. Our experiments demonstrate that SEAL outperforms state-of-the-art HIL methods and LLM-based planning approaches, particularly in settings with small expert datasets and complex long-horizon tasks.

SEAL: SEmantic-Augmented Imitation Learning via Language Model

TL;DR

SEAL addresses long-horizon hierarchical imitation learning by using pretrained LLMs to define semantically meaningful sub-goals and pre-label states, eliminating the need for task-specific hierarchies. It introduces a dual-encoder Sub-goal Learner (LLM-supervised and unsupervised VQ) and a transition-augmented low-level policy to emphasize critical sub-goal transitions during imitation. End-to-end training optimizes a weighted combination of high- and low-level objectives, with dynamic confidences guiding the integration of the encoders. Empirical results on KeyDoor and Grid-World show SEAL outperforming BC, LISA, SDIL, and Thought Cloning, especially in low-data and longer-horizon scenarios, demonstrating robust sub-goal discovery, better transition handling, and improved generalization to task variations.

Abstract

Hierarchical Imitation Learning (HIL) is a promising approach for tackling long-horizon decision-making tasks. While it is a challenging task due to the lack of detailed supervisory labels for sub-goal learning, and reliance on hundreds to thousands of expert demonstrations. In this work, we introduce SEAL, a novel framework that leverages Large Language Models (LLMs)'s powerful semantic and world knowledge for both specifying sub-goal space and pre-labeling states to semantically meaningful sub-goal representations without prior knowledge of task hierarchies. SEAL employs a dual-encoder structure, combining supervised LLM-guided sub-goal learning with unsupervised Vector Quantization (VQ) for more robust sub-goal representations. Additionally, SEAL incorporates a transition-augmented low-level planner for improved adaptation to sub-goal transitions. Our experiments demonstrate that SEAL outperforms state-of-the-art HIL methods and LLM-based planning approaches, particularly in settings with small expert datasets and complex long-horizon tasks.
Paper Structure (20 sections, 8 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 20 sections, 8 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of SEAL Architecture: The LLM aids in discovering sub-goal spaces for the task by semantically decomposing the full-task instruction and labeling each state with a reference latent vector that represents its corresponding sub-goal. These reference labels are then used to train a high-level sub-goal encoder, which works in conjunction with an unsupervised VQ encoder.
  • Figure 2: Visualization: Sub-goal selection in an example trajectory instance of Grid-World with 3 Objects. We color-code each sub-goal and black circle marks the final step of each trajectory. The ground-truth is labeled by human in this case, and SEAL achieve the best sub-goal transitions.
  • Figure 3: Comparison of success rates among different sub-goal number $K$ selection in unsupervised HIL baselines LISA and SDIL. Experiments set on Grid-World with 3 Objects. $x$-axis represents the different settings of $K$.
  • Figure 4: Examples of compositional-task-related environments used in our experiments. Left:KeyDoor. The player needs to pick up the key and then use it to unlock the door. Right:Grid-World. The player needs to pick up the different objects in a pre-specified order.
  • Figure 5: A schematic illustrating how LLMs are prompted to define sub-goal spaces from task instructions and map states to sub-goal representations, serving as supervisory labels for training the high-level sub-goal encoder in SEAL.