Table of Contents
Fetching ...

MO-MIX: Multi-Objective Multi-Agent Cooperative Decision-Making With Deep Reinforcement Learning

Tianmeng Hu, Biao Luo, Chunhua Yang, Tingwen Huang

TL;DR

This paper proposes MO-MIX to solve the multi-objective multi-agent reinforcement learning (MOMARL) problem and generates an approximation of the Pareto set.

Abstract

Deep reinforcement learning (RL) has been applied extensively to solve complex decision-making problems. In many real-world scenarios, tasks often have several conflicting objectives and may require multiple agents to cooperate, which are the multi-objective multi-agent decision-making problems. However, only few works have been conducted on this intersection. Existing approaches are limited to separate fields and can only handle multi-agent decision-making with a single objective, or multi-objective decision-making with a single agent. In this paper, we propose MO-MIX to solve the multi-objective multi-agent reinforcement learning (MOMARL) problem. Our approach is based on the centralized training with decentralized execution (CTDE) framework. A weight vector representing preference over the objectives is fed into the decentralized agent network as a condition for local action-value function estimation, while a mixing network with parallel architecture is used to estimate the joint action-value function. In addition, an exploration guide approach is applied to improve the uniformity of the final non-dominated solutions. Experiments demonstrate that the proposed method can effectively solve the multi-objective multi-agent cooperative decision-making problem and generate an approximation of the Pareto set. Our approach not only significantly outperforms the baseline method in all four kinds of evaluation metrics, but also requires less computational cost.

MO-MIX: Multi-Objective Multi-Agent Cooperative Decision-Making With Deep Reinforcement Learning

TL;DR

This paper proposes MO-MIX to solve the multi-objective multi-agent reinforcement learning (MOMARL) problem and generates an approximation of the Pareto set.

Abstract

Deep reinforcement learning (RL) has been applied extensively to solve complex decision-making problems. In many real-world scenarios, tasks often have several conflicting objectives and may require multiple agents to cooperate, which are the multi-objective multi-agent decision-making problems. However, only few works have been conducted on this intersection. Existing approaches are limited to separate fields and can only handle multi-agent decision-making with a single objective, or multi-objective decision-making with a single agent. In this paper, we propose MO-MIX to solve the multi-objective multi-agent reinforcement learning (MOMARL) problem. Our approach is based on the centralized training with decentralized execution (CTDE) framework. A weight vector representing preference over the objectives is fed into the decentralized agent network as a condition for local action-value function estimation, while a mixing network with parallel architecture is used to estimate the joint action-value function. In addition, an exploration guide approach is applied to improve the uniformity of the final non-dominated solutions. Experiments demonstrate that the proposed method can effectively solve the multi-objective multi-agent cooperative decision-making problem and generate an approximation of the Pareto set. Our approach not only significantly outperforms the baseline method in all four kinds of evaluation metrics, but also requires less computational cost.
Paper Structure (24 sections, 23 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 24 sections, 23 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: The architecture of the CAN. Each agent uses its respective CAN to estimate the partial multi-objective Q-function and to select the action. The inputs to a CAN are the partial observation of agent $i$, the action of the previous time step and the current preference $\boldsymbol{w}$. That is, we use the preference $\boldsymbol{w}$ as a condition for estimating the value function. A GRU layer is used to take advantage of the partial observation history. The output of the CAN is multi-objective Q-vectors for each optional action, each vector containing Q-values for $m$ objectives. The agents select actions independently based on the $\epsilon$-greedy policy.
  • Figure 2: The architecture of the MOMN. The MOMN takes outputs of all CANs and reorganizes them according to different objectives. The MOMN is internally divided into $m$ parallel tracks corresponding to $m$ objectives. Each track has two MLP layers. Multiple track outputs are then concatenated as the outputs of the entire network, which are the multi-objective joint Q-values. The global state $s_{t}$ at time step $t$ is fed to several hypernetworks and used to generate weights and biases for the mixing network.
  • Figure 3: The connection between hypernetworks and one MOMN track. For each neural network layer of each MOMN track, two hypernetworks are used to generate its parameters. One is used for the weights and another for the biases.
  • Figure 4: The average utility curve of 75000 episodes.The data are based on five independent runs of the MO-MIX algorithm and local means were estimated using a sliding average algorithm with a parameter of 0.85. The light-colored part shows the standard deviation.
  • Figure 5: The hypervolume variation curve. Data are based on five independent runs of the MO-MIX algorithm, tested once every 5000 episodes. The light-colored part shows the standard deviation.
  • ...and 5 more figures