Table of Contents
Fetching ...

Learning Novel Skills from Language-Generated Demonstrations

Ao-Qun Jin, Tian-Yu Xiang, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Yue Cao, Sheng-Bin Duan, Fu-Chao Xie, Zeng-Guang Hou

TL;DR

DemoGen tackles the data and safety bottlenecks of learning novel robotic skills by turning natural language task descriptions into demonstration videos via a vision‑language model and a diffusion‑based video generator. It then extracts state–action pairs with an inverse dynamics model and learns policies through imitation learning, enabling both zero‑shot and few‑shot skill acquisition in simulation. In MetaWorld, generated demonstrations achieve high fidelity and, for novel tasks, drive approximately a threefold improvement in task accomplishment over baselines. This framework reduces data collection costs and safety risks while providing a scalable, language‑driven path toward broad robotic skill acquisition with potential real‑world validation.

Abstract

Robots are increasingly deployed across diverse domains to tackle tasks requiring novel skills. However, current robot learning algorithms for acquiring novel skills often rely on demonstration datasets or environment interactions, resulting in high labor costs and potential safety risks. To address these challenges, this study proposes DemoGen, a skill-learning framework that enables robots to acquire novel skills from natural language instructions. DemoGen leverages the vision-language model and the video diffusion model to generate demonstration videos of novel skills, which enabling robots to learn new skills effectively. Experimental evaluations in the MetaWorld simulation environments demonstrate the pipeline's capability to generate high-fidelity and reliable demonstrations. Using the generated demonstrations, various skill learning algorithms achieve an accomplishment rate three times the original on novel tasks. These results highlight a novel approach to robot learning, offering a foundation for the intuitive and intelligent acquisition of novel robotic skills. (Project website: https://aoqunjin.github.io/LNSLGD/)

Learning Novel Skills from Language-Generated Demonstrations

TL;DR

DemoGen tackles the data and safety bottlenecks of learning novel robotic skills by turning natural language task descriptions into demonstration videos via a vision‑language model and a diffusion‑based video generator. It then extracts state–action pairs with an inverse dynamics model and learns policies through imitation learning, enabling both zero‑shot and few‑shot skill acquisition in simulation. In MetaWorld, generated demonstrations achieve high fidelity and, for novel tasks, drive approximately a threefold improvement in task accomplishment over baselines. This framework reduces data collection costs and safety risks while providing a scalable, language‑driven path toward broad robotic skill acquisition with potential real‑world validation.

Abstract

Robots are increasingly deployed across diverse domains to tackle tasks requiring novel skills. However, current robot learning algorithms for acquiring novel skills often rely on demonstration datasets or environment interactions, resulting in high labor costs and potential safety risks. To address these challenges, this study proposes DemoGen, a skill-learning framework that enables robots to acquire novel skills from natural language instructions. DemoGen leverages the vision-language model and the video diffusion model to generate demonstration videos of novel skills, which enabling robots to learn new skills effectively. Experimental evaluations in the MetaWorld simulation environments demonstrate the pipeline's capability to generate high-fidelity and reliable demonstrations. Using the generated demonstrations, various skill learning algorithms achieve an accomplishment rate three times the original on novel tasks. These results highlight a novel approach to robot learning, offering a foundation for the intuitive and intelligent acquisition of novel robotic skills. (Project website: https://aoqunjin.github.io/LNSLGD/)

Paper Structure

This paper contains 25 sections, 6 equations, 6 figures, 5 tables, 2 algorithms.

Figures (6)

  • Figure 1: Demonstration of the novel skill learning steps of the DemoGen's pipeline. For each task, VLM generates an extended text description. With the extended text description, a DVG generates the demonstration videos. Finally, these videos undergo an inverse dynamic model IDM to extract action labels. Robots can learn from generated demonstrations and acquire novel tasks.
  • Figure 2: An overview of DemoGen. The task learning process involves four modules: vision language model (a), demonstration video generator (b), inverse dynamic model (c) and imitation learning model (d).
  • Figure 3: The proposed framework can generate demonstrations that show fidelity, diversity (a, b) and creativity (c).
  • Figure 4: The cosine similarity matrices between prompt embeddings of different tasks (c). The embeddings of expanded prompts (a) exhibit higher cosine similarity scores than the original prompts (b).
  • Figure 5: The frame sequences of the generated robot actions, which keep the consistency with the input trajectories.
  • ...and 1 more figures