Rethinking Mutual Information for Language Conditioned Skill Discovery on Imitation Learning

Zhaoxun Ju; Chao Yang; Hongbo Wang; Yu Qiao; Fuchun Sun

Rethinking Mutual Information for Language Conditioned Skill Discovery on Imitation Learning

Zhaoxun Ju, Chao Yang, Hongbo Wang, Yu Qiao, Fuchun Sun

TL;DR

This work tackles learning language-conditioned robotics in multi-task, long-horizon settings without external rewards. It proposes LCSD, a two-stage framework that jointly learns discrete latent skills via a VQ-VAE and a diffusion-based policy, with mutual-information objectives linking $z$, $l$, and the state $s$. By introducing language reconstruction in the skill decoder and a codebook reinitialization mechanism, LCSD achieves interpretable, diverse skills and improved generalization on BabyAI, LORel, and CALVIN, outperforming prior language-conditioned and skill-based IL methods. The approach advances practical deployment of language-guided imitation by combining principled MI objectives, discrete skill representations, and flexible diffusion policies, while providing extensive ablations and analysis of robustness and interpretability.

Abstract

Language-conditioned robot behavior plays a vital role in executing complex tasks by associating human commands or instructions with perception and actions. The ability to compose long-horizon tasks based on unconstrained language instructions necessitates the acquisition of a diverse set of general-purpose skills. However, acquiring inherent primitive skills in a coupled and long-horizon environment without external rewards or human supervision presents significant challenges. In this paper, we evaluate the relationship between skills and language instructions from a mathematical perspective, employing two forms of mutual information within the framework of language-conditioned policy learning. To maximize the mutual information between language and skills in an unsupervised manner, we propose an end-to-end imitation learning approach known as Language Conditioned Skill Discovery (LCSD). Specifically, we utilize vector quantization to learn discrete latent skills and leverage skill sequences of trajectories to reconstruct high-level semantic instructions. Through extensive experiments on language-conditioned robotic navigation and manipulation tasks, encompassing BabyAI, LORel, and CALVIN, we demonstrate the superiority of our method over prior works. Our approach exhibits enhanced generalization capabilities towards unseen tasks, improved skill interpretability, and notably higher rates of task completion success.

Rethinking Mutual Information for Language Conditioned Skill Discovery on Imitation Learning

TL;DR

, and the state

. By introducing language reconstruction in the skill decoder and a codebook reinitialization mechanism, LCSD achieves interpretable, diverse skills and improved generalization on BabyAI, LORel, and CALVIN, outperforming prior language-conditioned and skill-based IL methods. The approach advances practical deployment of language-guided imitation by combining principled MI objectives, discrete skill representations, and flexible diffusion policies, while providing extensive ablations and analysis of robustness and interpretability.

Abstract

Paper Structure (28 sections, 12 equations, 18 figures, 8 tables, 1 algorithm)

This paper contains 28 sections, 12 equations, 18 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Language Conditioned Policy
Skill Discovery via mutual information
Preliminary
Approach
Problem Formulation
Mutual Information Skill Learning in LCSD
Skill learning
Diffusion policy for Imitation Learning
Experiments
Tasks
Baselines
Results
Ablation Study
...and 13 more sections

Figures (18)

Figure 1: An example of multi-task language conditioned situation. When confronted with intricate language instructions such as "open the drawer and turn the faucet to the right," the agent must decipher and execute the tasks based on the current state.
Figure 2: Overview of LCSD. In the skill learning stage, the encoder decomposes the current state and language to a lower-dimensional latent space, while the decoder recovers the quantized latent skills to the language embeddings. A single vector is chosen from the codebook in each step and used to quantize the encoder outputs. The diffusion model is used as an action predictor conditioning on current state and skill(or language).
Figure 3: Instruction Semantic Recovery Diagram. The decoder's objective is to choose a distinct skill from each consecutive group within a trajectory and calculate the mean squared error (MSE) loss using the frozen CLIP clip language embedding.
Figure 4: Skill-language mapping in LORel state environment. Up: skill-language graph on LISA (single encoder); Down: skill-language diagram of our LCSD.
Figure 5: MI training curve in CALVIN and LORel with difference skill learning methods. We show the mutual information curves of our method during training in different environments on different skill learning methods.
...and 13 more figures

Rethinking Mutual Information for Language Conditioned Skill Discovery on Imitation Learning

TL;DR

Abstract

Rethinking Mutual Information for Language Conditioned Skill Discovery on Imitation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (18)