Table of Contents
Fetching ...

Multi-Agent Reinforcement Learning with a Hierarchy of Reward Machines

Xuejing Zheng, Chao Yu

TL;DR

MAHRM exploits the relationship of high-level events to decompose a task into a hierarchy of simpler subtasks that are assigned to a small group of agents, so as to reduce the overall computational complexity.

Abstract

In this paper, we study the cooperative Multi-Agent Reinforcement Learning (MARL) problems using Reward Machines (RMs) to specify the reward functions such that the prior knowledge of high-level events in a task can be leveraged to facilitate the learning efficiency. Unlike the existing work that RMs have been incorporated into MARL for task decomposition and policy learning in relatively simple domains or with an assumption of independencies among the agents, we present Multi-Agent Reinforcement Learning with a Hierarchy of RMs (MAHRM) that is capable of dealing with more complex scenarios when the events among agents can occur concurrently and the agents are highly interdependent. MAHRM exploits the relationship of high-level events to decompose a task into a hierarchy of simpler subtasks that are assigned to a small group of agents, so as to reduce the overall computational complexity. Experimental results in three cooperative MARL domains show that MAHRM outperforms other MARL methods using the same prior knowledge of high-level events.

Multi-Agent Reinforcement Learning with a Hierarchy of Reward Machines

TL;DR

MAHRM exploits the relationship of high-level events to decompose a task into a hierarchy of simpler subtasks that are assigned to a small group of agents, so as to reduce the overall computational complexity.

Abstract

In this paper, we study the cooperative Multi-Agent Reinforcement Learning (MARL) problems using Reward Machines (RMs) to specify the reward functions such that the prior knowledge of high-level events in a task can be leveraged to facilitate the learning efficiency. Unlike the existing work that RMs have been incorporated into MARL for task decomposition and policy learning in relatively simple domains or with an assumption of independencies among the agents, we present Multi-Agent Reinforcement Learning with a Hierarchy of RMs (MAHRM) that is capable of dealing with more complex scenarios when the events among agents can occur concurrently and the agents are highly interdependent. MAHRM exploits the relationship of high-level events to decompose a task into a hierarchy of simpler subtasks that are assigned to a small group of agents, so as to reduce the overall computational complexity. Experimental results in three cooperative MARL domains show that MAHRM outperforms other MARL methods using the same prior knowledge of high-level events.
Paper Structure (13 sections, 4 equations, 6 figures, 1 algorithm)

This paper contains 13 sections, 4 equations, 6 figures, 1 algorithm.

Figures (6)

  • Figure 1: (a) The Pass domain with three agents and four buttons tagged by a,b,c and d. (b) The 3-level hierarchical structure of propositions in the Pass domain.
  • Figure 2: (a) For a primitive proposition p(i), its RM is denoted as $\mathcal{R}_{\texttt{p}}[\texttt{i}]$ and is constructed as follows: the states are $U=\{u_0,u_1\}$, and it transits to the terminal state $u_1$ if and only if the proposition p(i) becomes true, otherwise transits to $u_0$. (b) The RM of the subtask of proposition ab_c_a assigned to agents i,j,k. At the beginning, if the agent i and j press the button a and b so that the agent k reaches the room, then the RM transits to $u_1$. Then keeping the button a pressed, the agent k presses the button c to enable agent j to reach the room and the RM transits to $u_2$. Finally, the agent j presses the button d to let agent i reaches the room and the RM terminates at $u_3$. (c) The RM of the joint task consists of the initial state $u_0$ and the terminal state $u_1$, and its transitions are specified by propositions at level-2. The joint task is completed if one of the subtask of proposition ab_c_a,ab_c_b,ab_d_a or ab_d_b is completed.
  • Figure 3: Experimental results in the Navigation domain for $N=2,3$ and 5 agents.
  • Figure 4: Experimental results in the MineCraft (Left) and the Pass (Right) domains.
  • Figure 5: The RM used in the MineCraft domain contains 7 states, with the initial state $u_0$ and the terminal state $u_6$. Its transitions are specified by the primitive propositions a(i),b(i) and c(i), which means that the objects a,b and c are collected by agent i, respectively (i=1,2,3). It defines the reward function of the joint task described as follows. First, three agents have to complete the following two subtasks in any order: "agent 1 and 2 get objectasimultaneously", and "agent 3 gets objectc". After the objects a and c are collected, the RM transits to $u_3$. Then these agents have to complete other two subtasks in any order: "agent 2 and 3 get objectbsimultaneously", and "agent 1 gets objectc". Finally, the joint task is completed and the RM transits to $u_6$.
  • ...and 1 more figures