A Robotic Skill Learning System Built Upon Diffusion Policies and Foundation Models

Nils Ingelhag; Jesper Munkeby; Jonne van Haastregt; Anastasia Varava; Michael C. Welle; Danica Kragic

A Robotic Skill Learning System Built Upon Diffusion Policies and Foundation Models

Nils Ingelhag, Jesper Munkeby, Jonne van Haastregt, Anastasia Varava, Michael C. Welle, Danica Kragic

TL;DR

The paper tackles scalable robotic skill learning in long-tail manipulation tasks by presenting the Robotic Skill Learning System (RSLS), which combines diffusion-based visuomotor policies for skill execution with foundation-model–driven skill selection and precondition validation. New skills are acquired through teleoperated demonstrations (approximately 50–150 per skill) and added to a growing skill library, enabling continuous expansion. The authors evaluate RSLS in both simulated and real-world food-serving scenarios, comparing two leading foundation models (GPT-4 and Gemini) for skill matching and precondition checks, and demonstrate substantial performance gains when integrating LLM and VLM components. The work demonstrates practical impact by enabling robots to learn new tasks with modest demonstration data and by validating the end-to-end framework across multiple environments, with public results and videos available online.

Abstract

In this paper, we build upon two major recent developments in the field, Diffusion Policies for visuomotor manipulation and large pre-trained multimodal foundational models to obtain a robotic skill learning system. The system can obtain new skills via the behavioral cloning approach of visuomotor diffusion policies given teleoperated demonstrations. Foundational models are being used to perform skill selection given the user's prompt in natural language. Before executing a skill the foundational model performs a precondition check given an observation of the workspace. We compare the performance of different foundational models to this end as well as give a detailed experimental evaluation of the skills taught by the user in simulation and the real world. Finally, we showcase the combined system on a challenging food serving scenario in the real world. Videos of all experimental executions, as well as the process of teaching new skills in simulation and the real world, are available on the project's website.

A Robotic Skill Learning System Built Upon Diffusion Policies and Foundation Models

TL;DR

Abstract

Paper Structure (15 sections, 7 figures, 3 tables)

This paper contains 15 sections, 7 figures, 3 tables.

Introduction
Background & Related Work
Robotic Skill Learning System
Teaching a Skill
Training/Executing a Skill
Skill Selector
Simulation Setup
Real World Setup
Experimental Evaluation
Human in the Loop
Simulation Setting
Real World Tasks
Skill Selector
Validation of SSLE Framework
conclusion

Figures (7)

Figure 1: Conceptual overview of our Robotic Skill Learning System. The system receives the user's instructional prompt and an image of the current state. The skill selector module - realized through a foundational model - selects an appropriate skill to perform the task. If no suitable skill is available, the system asks the user to perform a number of demonstrations and train a new skill using visuomotor diffusion policies.
Figure 2: Flowchart overview of our RSLS method. Yellow boxes indicate the Skill Selector realized through a foundational model, green indicates the user's activity, and purple shows the visuomotor diffusion policy.
Figure 3: Setup of the simulation environment, including the VR views for the left and right eye. Note that as the user is free to traverse the virtual environment he can obtain different views than those shown on a 2D screen.
Figure 4: The real world setting indicating the workspace (blue) and the tool changer (pink) containing a bottle opener, a serving spoon, and a custom gripper for the sausages.
Figure 5: Demonstration time histograms for the lid removal (blue), box pushing (orange), and item placing (green) in simulation (left) as well as for the real-world tasks Bottle opening (blue), lid removal (orange), rice scooping (green), and sausage placing (red) on the right. The dashed lines indicate the mean duration of the respective task.
...and 2 more figures

A Robotic Skill Learning System Built Upon Diffusion Policies and Foundation Models

TL;DR

Abstract

A Robotic Skill Learning System Built Upon Diffusion Policies and Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)