Table of Contents
Fetching ...

Guided Deep Reinforcement Learning for Swarm Systems

Maximilian Hüttenrauch, Adrian Šošić, Gerhard Neumann

TL;DR

The paper tackles learning decentralized policies for swarms with limited sensing under partial observability. It introduces a guided actor-critic framework where a centralized critic has access to the global state during training to learn a joint $Q(s,a)$, while agents act based on local histories via a policy $\mu(h)$. Key contributions include a swarm-MDP formulation, a centralized guided critic, distributed local actors, a histogram-based observation model, and empirical validation on two Kilobot-inspired tasks showing improved learning over non-guided baselines. The work demonstrates the potential to scale deep reinforcement learning to cooperative swarm robotics and points to future directions for incorporating richer state information and sensing.

Abstract

In this paper, we investigate how to learn to control a group of cooperative agents with limited sensing capabilities such as robot swarms. The agents have only very basic sensor capabilities, yet in a group they can accomplish sophisticated tasks, such as distributed assembly or search and rescue tasks. Learning a policy for a group of agents is difficult due to distributed partial observability of the state. Here, we follow a guided approach where a critic has central access to the global state during learning, which simplifies the policy evaluation problem from a reinforcement learning point of view. For example, we can get the positions of all robots of the swarm using a camera image of a scene. This camera image is only available to the critic and not to the control policies of the robots. We follow an actor-critic approach, where the actors base their decisions only on locally sensed information. In contrast, the critic is learned based on the true global state. Our algorithm uses deep reinforcement learning to approximate both the Q-function and the policy. The performance of the algorithm is evaluated on two tasks with simple simulated 2D agents: 1) finding and maintaining a certain distance to each others and 2) locating a target.

Guided Deep Reinforcement Learning for Swarm Systems

TL;DR

The paper tackles learning decentralized policies for swarms with limited sensing under partial observability. It introduces a guided actor-critic framework where a centralized critic has access to the global state during training to learn a joint , while agents act based on local histories via a policy . Key contributions include a swarm-MDP formulation, a centralized guided critic, distributed local actors, a histogram-based observation model, and empirical validation on two Kilobot-inspired tasks showing improved learning over non-guided baselines. The work demonstrates the potential to scale deep reinforcement learning to cooperative swarm robotics and points to future directions for incorporating richer state information and sensing.

Abstract

In this paper, we investigate how to learn to control a group of cooperative agents with limited sensing capabilities such as robot swarms. The agents have only very basic sensor capabilities, yet in a group they can accomplish sophisticated tasks, such as distributed assembly or search and rescue tasks. Learning a policy for a group of agents is difficult due to distributed partial observability of the state. Here, we follow a guided approach where a critic has central access to the global state during learning, which simplifies the policy evaluation problem from a reinforcement learning point of view. For example, we can get the positions of all robots of the swarm using a camera image of a scene. This camera image is only available to the critic and not to the control policies of the robots. We follow an actor-critic approach, where the actors base their decisions only on locally sensed information. In contrast, the critic is learned based on the true global state. Our algorithm uses deep reinforcement learning to approximate both the Q-function and the policy. The performance of the algorithm is evaluated on two tasks with simple simulated 2D agents: 1) finding and maintaining a certain distance to each others and 2) locating a target.

Paper Structure

This paper contains 19 sections, 16 equations, 8 figures.

Figures (8)

  • Figure 1: The figure shows a Kilobot robot. This platform is a base for the simulated agents in this paper.
  • Figure 2: Visualization of the simulation environment. The left figure shows the graph task where agents are depicted with green dots and an orientation arrow. The outer red circle shows the communication radius and the inner light green area the distance in which a valid edge is established. The right figure shows the localization task. As long as agents have not found the target they are depicted by a red dot, after finding the target the color is changed to green. The outer red circle again shows the communication radius. The target is depicted by a blue dot.
  • Figure 3: Progression of an episode of 8 agents executing a policy learned by 3 and by 6 agents. In the beginning they are randomly placed in the scene. Over the course of the episode groups of agents are established. These groups move in circular patterns trying to keep their distance to each other.
  • Figure 4: Evaluation of all learned policies of the edge task executed on 2 to 8 agents. Each policy is run 500 times and the plots show the mean return of the learned policies and two times the standard deviation.
  • Figure 5: Mean learning curve for 2 to 8 agents with two times standard deviation for the edge task.
  • ...and 3 more figures