Learning Reward Machines in Cooperative Multi-Agent Tasks
Leo Ardon, Daniel Furelos-Blanco, Alessandra Russo
TL;DR
This work tackles non-Markovian rewards in cooperative multi-agent RL by learning Reward Machines for sub-tasks and integrating them with decentralized Q-learning. Each agent maintains its own RM and Q-function, and synchronization on shared propositions ensures alignment toward the global objective. Experiments on ThreeButtons and Rendezvous demonstrate that learned per-agent RMs can achieve maximal collective rewards and speed up learning, while naive global RM learning can be intractable for larger problems. The approach enhances interpretability of learned policies and offers a scalable path for complex multi-agent environments.
Abstract
This paper presents a novel approach to Multi-Agent Reinforcement Learning (MARL) that combines cooperative task decomposition with the learning of reward machines (RMs) encoding the structure of the sub-tasks. The proposed method helps deal with the non-Markovian nature of the rewards in partially observable environments and improves the interpretability of the learnt policies required to complete the cooperative task. The RMs associated with each sub-task are learnt in a decentralised manner and then used to guide the behaviour of each agent. By doing so, the complexity of a cooperative multi-agent problem is reduced, allowing for more effective learning. The results suggest that our approach is a promising direction for future research in MARL, especially in complex environments with large state spaces and multiple agents.
