Rethinking Mutual Information for Language Conditioned Skill Discovery on Imitation Learning
Zhaoxun Ju, Chao Yang, Hongbo Wang, Yu Qiao, Fuchun Sun
TL;DR
This work tackles learning language-conditioned robotics in multi-task, long-horizon settings without external rewards. It proposes LCSD, a two-stage framework that jointly learns discrete latent skills via a VQ-VAE and a diffusion-based policy, with mutual-information objectives linking $z$, $l$, and the state $s$. By introducing language reconstruction in the skill decoder and a codebook reinitialization mechanism, LCSD achieves interpretable, diverse skills and improved generalization on BabyAI, LORel, and CALVIN, outperforming prior language-conditioned and skill-based IL methods. The approach advances practical deployment of language-guided imitation by combining principled MI objectives, discrete skill representations, and flexible diffusion policies, while providing extensive ablations and analysis of robustness and interpretability.
Abstract
Language-conditioned robot behavior plays a vital role in executing complex tasks by associating human commands or instructions with perception and actions. The ability to compose long-horizon tasks based on unconstrained language instructions necessitates the acquisition of a diverse set of general-purpose skills. However, acquiring inherent primitive skills in a coupled and long-horizon environment without external rewards or human supervision presents significant challenges. In this paper, we evaluate the relationship between skills and language instructions from a mathematical perspective, employing two forms of mutual information within the framework of language-conditioned policy learning. To maximize the mutual information between language and skills in an unsupervised manner, we propose an end-to-end imitation learning approach known as Language Conditioned Skill Discovery (LCSD). Specifically, we utilize vector quantization to learn discrete latent skills and leverage skill sequences of trajectories to reconstruct high-level semantic instructions. Through extensive experiments on language-conditioned robotic navigation and manipulation tasks, encompassing BabyAI, LORel, and CALVIN, we demonstrate the superiority of our method over prior works. Our approach exhibits enhanced generalization capabilities towards unseen tasks, improved skill interpretability, and notably higher rates of task completion success.
