Table of Contents
Fetching ...

Multi-Agent Reinforcement Learning for Unmanned Aerial Vehicle Coordination by Multi-Critic Policy Gradient Optimization

Yoav Alon, Huiyu Zhou

TL;DR

The proposed Multi-Critic Policy Optimization architecture with multiple value estimating networks and a novel advantage function that optimizes a stochastic actor policy network to achieve optimal coordination of agents achieves optimal coordination and compliance with constraints such as collision avoidance.

Abstract

Recent technological progress in the development of Unmanned Aerial Vehicles (UAVs) together with decreasing acquisition costs make the application of drone fleets attractive for a wide variety of tasks. In agriculture, disaster management, search and rescue operations, commercial and military applications, the advantage of applying a fleet of drones originates from their ability to cooperate autonomously. Multi-Agent Reinforcement Learning approaches that aim to optimize a neural network based control policy, such as the best performing actor-critic policy gradient algorithms, struggle to effectively back-propagate errors of distinct rewards signal sources and tend to favor lucrative signals while neglecting coordination and exploitation of previously learned similarities. We propose a Multi-Critic Policy Optimization architecture with multiple value estimating networks and a novel advantage function that optimizes a stochastic actor policy network to achieve optimal coordination of agents. Consequently, we apply the algorithm to several tasks that require the collaboration of multiple drones in a physics-based reinforcement learning environment. Our approach achieves a stable policy network update and similarity in reward signal development for an increasing number of agents. The resulting policy achieves optimal coordination and compliance with constraints such as collision avoidance.

Multi-Agent Reinforcement Learning for Unmanned Aerial Vehicle Coordination by Multi-Critic Policy Gradient Optimization

TL;DR

The proposed Multi-Critic Policy Optimization architecture with multiple value estimating networks and a novel advantage function that optimizes a stochastic actor policy network to achieve optimal coordination of agents achieves optimal coordination and compliance with constraints such as collision avoidance.

Abstract

Recent technological progress in the development of Unmanned Aerial Vehicles (UAVs) together with decreasing acquisition costs make the application of drone fleets attractive for a wide variety of tasks. In agriculture, disaster management, search and rescue operations, commercial and military applications, the advantage of applying a fleet of drones originates from their ability to cooperate autonomously. Multi-Agent Reinforcement Learning approaches that aim to optimize a neural network based control policy, such as the best performing actor-critic policy gradient algorithms, struggle to effectively back-propagate errors of distinct rewards signal sources and tend to favor lucrative signals while neglecting coordination and exploitation of previously learned similarities. We propose a Multi-Critic Policy Optimization architecture with multiple value estimating networks and a novel advantage function that optimizes a stochastic actor policy network to achieve optimal coordination of agents. Consequently, we apply the algorithm to several tasks that require the collaboration of multiple drones in a physics-based reinforcement learning environment. Our approach achieves a stable policy network update and similarity in reward signal development for an increasing number of agents. The resulting policy achieves optimal coordination and compliance with constraints such as collision avoidance.

Paper Structure

This paper contains 28 sections, 89 equations, 14 figures, 2 tables, 1 algorithm.

Figures (14)

  • Figure 1: Coordination of multiple UAVs using Multi-Critic Policy Gradient Optimization (MCPO) to train a single actor policy network achieving collision avoidance: (a) and (b) Target navigation and rotor balancing for a number of dynamically initialized and terminated drones. (c) Simplified Collision model. (d) Collision of drones when using a single-critic architecture (Best viewed in color).
  • Figure 2: Illustration of common architectures for multi-agent reinforcement learning. The top image shows a shared policy where all the agents share the same actor-policy network. For an equal state, they will behave identically. The middle image represents another architecture that applies different policies to respective groups of agents. In both cases, real coordination of agents' actions is usually achieved using various forms of communication as part of the action space. In the third and preferred architecture, coordination is an intrinsic characteristic of the architecture where a single actor policy can extract an action vector that is split into multiple sub-actions that are fed back to the agents. Their respective input state is merged from the individual states of all the agents. One of the challenges in such an architecture is to dynamically add or remove agents in an episode. In such a case, a multi-value critic-network may cause the terminated agents previously learned value estimation to deteriorate.
  • Figure 3: A single critic architecture that is adapted for multiple agents that adds up reward signals and exports them as scalar to its critic.
  • Figure 4: The proposed Poliymorph architecture feeds individual reward signals to corresponding critics, where their architectures may not be necessarily equal. The distinct training of value estimation networks leads to a better actor policy update.
  • Figure 5: Hybrid architecture with a single critic extracting multiple values.
  • ...and 9 more figures