Language-guided Skill Learning with Temporal Variational Inference

Haotian Fu; Pratyusha Sharma; Elias Stengel-Eskin; George Konidaris; Nicolas Le Roux; Marc-Alexandre Côté; Xingdi Yuan

Language-guided Skill Learning with Temporal Variational Inference

Haotian Fu, Pratyusha Sharma, Elias Stengel-Eskin, George Konidaris, Nicolas Le Roux, Marc-Alexandre Côté, Xingdi Yuan

TL;DR

The paper tackles long-horizon RL sample inefficiency by introducing LAST, a framework that first uses a pretrained LLM to generate an initial, fine-grained segmentation of expert trajectories and language annotations, then refines these segments via temporal variational inference to discover reusable skills. It couples a variational objective with an MDL-based compression term to balance reconstruction accuracy and skill economy, enabling an online hierarchical RL pipeline where a high-level controller switches between temporally extended skills. Key contributions include (1) language-guided initial segmentation, (2) a temporal variational inference mechanism that merges short segments into reusable skills, (3) an MDL auxiliary objective that promotes compact skill libraries, and (4) demonstration of improved zero-shot transfer and faster learning on BabyAI and ALFRED. The approach yields semantically meaningful skills and strengthens long-horizon planning, offering practical impact for sample-efficient RL in complex, multimodal environments. Formally, the method optimizes a variational lower bound and an MDL-based code-length objective while deploying a SAC-based online RL policy over a learned skill library.

Abstract

We present an algorithm for skill discovery from expert demonstrations. The algorithm first utilizes Large Language Models (LLMs) to propose an initial segmentation of the trajectories. Following that, a hierarchical variational inference framework incorporates the LLM-generated segmentation information to discover reusable skills by merging trajectory segments. To further control the trade-off between compression and reusability, we introduce a novel auxiliary objective based on the Minimum Description Length principle that helps guide this skill discovery process. Our results demonstrate that agents equipped with our method are able to discover skills that help accelerate learning and outperform baseline skill learning approaches on new long-horizon tasks in BabyAI, a grid world navigation environment, as well as ALFRED, a household simulation environment.

Language-guided Skill Learning with Temporal Variational Inference

TL;DR

Abstract

Paper Structure (23 sections, 21 equations, 11 figures, 4 tables)

This paper contains 23 sections, 21 equations, 11 figures, 4 tables.

Introduction
Related Work
Problem Setup
Language-guided Skill Learning with Temporal Variational Inference
Initial Segmentation with LLMs
Skill Discovery with Temporal Variational Inference
Minimum Description Length for Skills
Practical Implementation and Model Architecture
Online Hierarchical RL
Experiments
Experimental Setup
Zero-shot Transfer
Learning on Downstream Tasks
Qualitative Evaluation of the Learned Skills
Discussion
...and 8 more sections

Figures (11)

Figure 1: The trajectory segmentation and merging procedure.
Figure 2: Overall framework of LAST. Step 1: given a dataset of expert demonstrations, we query an LLM (only using the goal and actions as input) for an initial segmentation and a language description for each segment. Step 2: temporal variational inference takes in multi-modal data as input to improve upon the segmentation by merging different subsequences into skills. Step 3: online hierarchical RL on new tasks leveraging the learned skills which can greatly shorten the task horizon and help the agent efficiently learn on new tasks.
Figure 3: An overview of the probabilistic graphical model underlying LAST. Distributions are labeled with the same colors in Eqn. \ref{['eqa:obj1']}. We use $q(\beta_{:T},k_{:T}\mid \cdot)$ as the approximate posterior which has access to all the information we have. $p(\beta_{:T},k_{:T}\mid \cdot)$ is the true high-level policy that is trained to mimic $q(\beta_{:T},k_{:T}\mid \cdot)$. $p(a_t\mid o_{:t}, k_t, G)$ denotes the skill-goal conditioned policy.
Figure 4: Comparison results of our method LAST against other baselines in six downstream tasks of ALFRED. We plot average success rate v.s. timesteps with 95% confidence interval error bar ($\geq 5$ seeds).
Figure 5: Left: LAST's skill segmentation for task Put a microwaved potato in the sink. Right: Example of discovered skills and their most-commonly used actions. We show more trajectories, segmentations, and common skills in App. Fig. \ref{['fig:appqualt']} and \ref{['fig:app_pie']}.
...and 6 more figures

Language-guided Skill Learning with Temporal Variational Inference

TL;DR

Abstract

Language-guided Skill Learning with Temporal Variational Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (11)