Table of Contents
Fetching ...

Generative adversarial imitation learning for robot swarms: Learning from human demonstrations and trained policies

Mattes Kraus, Jonas Kuckling

TL;DR

This work provides a framework based on generative adversarial imitation learning that aims to learn collective behaviors from human demonstrations that is able to learn qualitatively meaningful behaviors that perform similarly well as the provided demonstrations.

Abstract

In imitation learning, robots are supposed to learn from demonstrations of the desired behavior. Most of the work in imitation learning for swarm robotics provides the demonstrations as rollouts of an existing policy. In this work, we provide a framework based on generative adversarial imitation learning that aims to learn collective behaviors from human demonstrations. Our framework is evaluated across six different missions, learning both from manual demonstrations and demonstrations derived from a PPO-trained policy. Results show that the imitation learning process is able to learn qualitatively meaningful behaviors that perform similarly well as the provided demonstrations. Additionally, we deploy the learned policies on a swarm of TurtleBot 4 robots in real-robot experiments. The exhibited behaviors preserved their visually recognizable character and their performance is comparable to the one achieved in simulation.

Generative adversarial imitation learning for robot swarms: Learning from human demonstrations and trained policies

TL;DR

This work provides a framework based on generative adversarial imitation learning that aims to learn collective behaviors from human demonstrations that is able to learn qualitatively meaningful behaviors that perform similarly well as the provided demonstrations.

Abstract

In imitation learning, robots are supposed to learn from demonstrations of the desired behavior. Most of the work in imitation learning for swarm robotics provides the demonstrations as rollouts of an existing policy. In this work, we provide a framework based on generative adversarial imitation learning that aims to learn collective behaviors from human demonstrations. Our framework is evaluated across six different missions, learning both from manual demonstrations and demonstrations derived from a PPO-trained policy. Results show that the imitation learning process is able to learn qualitatively meaningful behaviors that perform similarly well as the provided demonstrations. Additionally, we deploy the learned policies on a swarm of TurtleBot 4 robots in real-robot experiments. The exhibited behaviors preserved their visually recognizable character and their performance is comparable to the one achieved in simulation.
Paper Structure (16 sections, 5 figures, 1 table)

This paper contains 16 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Two screen captures from our demonstration tool. The user can build an experimental arena and provide demonstrations of swarm behaviors.
  • Figure 2: Information flow in our SwarmGAIL implementation. The discriminator receives swarm-level features as observation. The policy is used in a round-robin fashion to provide actions to all robots based on their local observations.
  • Figure 3: Representations of the demonstrated behavior for all considered missions. In the case of Full Speed and Controlled Speed, the trajectories taken by the robots look visually similar, but the velocity of the robots differs.
  • Figure 4: Violin plots of the return (cumulative reward) of all evaluations across all considered missions. Colors indicate the source of the demonstrations (blue from human-operated, brown from PPO-trained ones). Dark colors represent the initial policies of each imitation learning experiment, light colors the final policy. Black circles correspond to one evaluation of one policy/demonstration. The higher the return, the better the performance.
  • Figure 5: Violin plots of the returns (cumulative reward) of imitated policies in simulation (light colored) and reality (dark colored). Colors indicate the source of the demonstrations (blue from human-operated , brown from PPO-trained ones). Black circles correspond to one evaluation of one policy/demonstration. The higher the return the better the performance.