Language-Conditioned Imitation Learning with Base Skill Priors under Unstructured Data

Hongkuan Zhou; Zhenshan Bing; Xiangtong Yao; Xiaojie Su; Chenguang Yang; Kai Huang; Alois Knoll

Language-Conditioned Imitation Learning with Base Skill Priors under Unstructured Data

Hongkuan Zhou, Zhenshan Bing, Xiangtong Yao, Xiaojie Su, Chenguang Yang, Kai Huang, Alois Knoll

TL;DR

This work tackles the limitations of language-conditioned manipulation—namely data-intensive learning and poor generalization to unseen environments—by introducing Skill Priors in Imitation Learning (SPIL). SPIL converts the action space to a continuous skill space $\mathcal{A}_{\text{skill}} \in \mathbb{R}^{N_h \times 7}$ and learns how to compose base skills (translation, rotation, grasping) via an intermediate-level policy, guided by base-skill priors learned through a variational autoencoder with ELBO optimization $L_{ELBO}$. The model achieves state-of-the-art performance on the CALVIN benchmark, notably in zero-shot multi-environment settings (e.g., average task-length grows from $0.67$ to $1.71$ and one-to-five task success rates improve by up to $32.4\%$) and demonstrates substantial sim2real generalization (SPIL ~33% vs HULC ~3%). These results indicate that incorporating structured skill priors enables robust language-conditioned manipulation in novel environments and supports more practical real-world deployment of robotic systems.

Abstract

The growing interest in language-conditioned robot manipulation aims to develop robots capable of understanding and executing complex tasks, with the objective of enabling robots to interpret language commands and manipulate objects accordingly. While language-conditioned approaches demonstrate impressive capabilities for addressing tasks in familiar environments, they encounter limitations in adapting to unfamiliar environment settings. In this study, we propose a general-purpose, language-conditioned approach that combines base skill priors and imitation learning under unstructured data to enhance the algorithm's generalization in adapting to unfamiliar environments. We assess our model's performance in both simulated and real-world environments using a zero-shot setting. In the simulated environment, the proposed approach surpasses previously reported scores for CALVIN benchmark, especially in the challenging Zero-Shot Multi-Environment setting. The average completed task length, indicating the average number of tasks the agent can continuously complete, improves more than 2.5 times compared to the state-of-the-art method HULC. In addition, we conduct a zero-shot evaluation of our policy in a real-world setting, following training exclusively in simulated environments without additional specific adaptations. In this evaluation, we set up ten tasks and achieved an average 30% improvement in our approach compared to the current state-of-the-art approach, demonstrating a high generalization capability in both simulated environments and the real world. For further details, including access to our code and videos, please refer to https://hk-zh.github.io/spil/

Language-Conditioned Imitation Learning with Base Skill Priors under Unstructured Data

TL;DR

and learns how to compose base skills (translation, rotation, grasping) via an intermediate-level policy, guided by base-skill priors learned through a variational autoencoder with ELBO optimization

. The model achieves state-of-the-art performance on the CALVIN benchmark, notably in zero-shot multi-environment settings (e.g., average task-length grows from

and one-to-five task success rates improve by up to

) and demonstrates substantial sim2real generalization (SPIL ~33% vs HULC ~3%). These results indicate that incorporating structured skill priors enables robust language-conditioned manipulation in novel environments and supports more practical real-world deployment of robotic systems.

Abstract

Paper Structure (11 sections, 4 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 11 sections, 4 equations, 5 figures, 2 tables, 2 algorithms.

Introduction
Related works
Methodology
Overview
Base Skill Labeling
Continuous Skill Embeddings with Base Skill Priors
Imitation Learning with Base Skill Priors
Experiments
Environment Result
Real-world Experiments
Conclusion

Figures (5)

Figure 1: Comparison of common approaches (dashed red) and our approach (green). Common approaches usually directly learn the actions, depending on current observation and instruction. Our approach aims to learn the extra intermediate-level policy of which base skill to choose, based on current observation and instruction.
Figure 2: This architecture comprises two encoders - the action sequence encoder and the base skill locator (encoder), and a decoder for reconstructing the skill embeddings into action sequences. The base skill locator takes one-hot-key embeddings of translation, rotation, and grasping as input and outputs the distribution of the base skill prior in the skill latent space. The action sequence encoder encodes the action sequences with a fixed horizon of $N_h$ to the skill distribution in the latent space. The decoder then reconstructs the skill embedding into action sequences.
Figure 3: t-SNE visualization of skill latent space.
Figure 4: The Overall Architecture. Following the encoding process, the static observation, gripper observation, and language instruction are generated to embeddings for the plan, language goal, language, static observation, and gripper observation. The skill selector module subsequently decodes a sequence of skill embeddings using the plan, observation, and language goal embeddings. The skill labeler labels the skill embeddings with the base skills: translation, rotation, and grasping. The base skill regularization loss is calculated based on the base skill prior distributions (from base skill locator $f_{\boldsymbol{\kappa}}$), selected skill instance, and labeled probability indicating its belonging to specific base skills. This labeled probability is also leveraged to determine the categorical regularization loss. Finally, the pre-trained and frozen skill generator $f_{\boldsymbol{\theta}}$ decodes all the skill embeddings into action sequences, which are then utilized to calculate the reconstruction loss (Huber loss).
Figure 5: Real-world experiments. We employ the multi-task language control (MTLC) setting in the CALVIN benchmark, encompassing a total of 10 tasks as listed above. The agent is trained in the simulated CALVIN environment D and directly applied to the real-world setting.

Language-Conditioned Imitation Learning with Base Skill Priors under Unstructured Data

TL;DR

Abstract

Language-Conditioned Imitation Learning with Base Skill Priors under Unstructured Data

Authors

TL;DR

Abstract

Table of Contents

Figures (5)