Table of Contents
Fetching ...

Learning Reward Machines in Cooperative Multi-Agent Tasks

Leo Ardon, Daniel Furelos-Blanco, Alessandra Russo

TL;DR

This work tackles non-Markovian rewards in cooperative multi-agent RL by learning Reward Machines for sub-tasks and integrating them with decentralized Q-learning. Each agent maintains its own RM and Q-function, and synchronization on shared propositions ensures alignment toward the global objective. Experiments on ThreeButtons and Rendezvous demonstrate that learned per-agent RMs can achieve maximal collective rewards and speed up learning, while naive global RM learning can be intractable for larger problems. The approach enhances interpretability of learned policies and offers a scalable path for complex multi-agent environments.

Abstract

This paper presents a novel approach to Multi-Agent Reinforcement Learning (MARL) that combines cooperative task decomposition with the learning of reward machines (RMs) encoding the structure of the sub-tasks. The proposed method helps deal with the non-Markovian nature of the rewards in partially observable environments and improves the interpretability of the learnt policies required to complete the cooperative task. The RMs associated with each sub-task are learnt in a decentralised manner and then used to guide the behaviour of each agent. By doing so, the complexity of a cooperative multi-agent problem is reduced, allowing for more effective learning. The results suggest that our approach is a promising direction for future research in MARL, especially in complex environments with large state spaces and multiple agents.

Learning Reward Machines in Cooperative Multi-Agent Tasks

TL;DR

This work tackles non-Markovian rewards in cooperative multi-agent RL by learning Reward Machines for sub-tasks and integrating them with decentralized Q-learning. Each agent maintains its own RM and Q-function, and synchronization on shared propositions ensures alignment toward the global objective. Experiments on ThreeButtons and Rendezvous demonstrate that learned per-agent RMs can achieve maximal collective rewards and speed up learning, while naive global RM learning can be intractable for larger problems. The approach enhances interpretability of learned policies and offers a scalable path for complex multi-agent environments.

Abstract

This paper presents a novel approach to Multi-Agent Reinforcement Learning (MARL) that combines cooperative task decomposition with the learning of reward machines (RMs) encoding the structure of the sub-tasks. The proposed method helps deal with the non-Markovian nature of the rewards in partially observable environments and improves the interpretability of the learnt policies required to complete the cooperative task. The RMs associated with each sub-task are learnt in a decentralised manner and then used to guide the behaviour of each agent. By doing so, the complexity of a cooperative multi-agent problem is reduced, allowing for more effective learning. The results suggest that our approach is a promising direction for future research in MARL, especially in complex environments with large state spaces and multiple agents.
Paper Structure (15 sections, 1 equation, 6 figures, 1 algorithm)

This paper contains 15 sections, 1 equation, 6 figures, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of the ThreeButtons grid (a) and a reward machine modeling the task's structure (b) Neary_Xu_Wu_Topcu_2021.
  • Figure 2: RMs for each of the agents in ThreeButtonsNeary_Xu_Wu_Topcu_2021.
  • Figure 3: Comparison between handcrafted RMs (RM Provided) and our approach learning the RMs from traces (RM Learnt) in the ThreeButtons environment.
  • Figure 4: Learnt RM for $A_2$.
  • Figure 5: Example of the Rendezvous task where $2$ agents must meet on the RDV point (green) before reaching their goal state $G1$ and $G2$ for agents $A1$ and $A2$ respectively.
  • ...and 1 more figures