Table of Contents
Fetching ...

Learning Symbolic Task Decompositions for Multi-Agent Teams

Ameesh Shah, Niklas Lauffer, Thomas Chen, Nikhil Pitta, Sanjit A. Seshia

TL;DR

The paper tackles the credit assignment problem in cooperative multi-agent reinforcement learning by automatically learning how to decompose a complex task into sub-tasks using reward machines. It introduces LOTaD, which simultaneously searches over a set of candidate task decompositions and trains task-conditioned policies for each sub-task, guided by an upper confidence bound strategy to balance exploration and exploitation. The approach relaxes the assumption of independent agent dynamics by providing a global view of the overall task and incentivizing coordination, enabling effective learning even in environments with codependent dynamics. Experimental results across Repairs, Buttons, and Overcooked domains show that LOTaD outperforms baselines, improves sample efficiency, and demonstrates the practicality of automated symbolic task decomposition for multi-agent teams.

Abstract

One approach for improving sample efficiency in cooperative multi-agent learning is to decompose overall tasks into sub-tasks that can be assigned to individual agents. We study this problem in the context of reward machines: symbolic tasks that can be formally decomposed into sub-tasks. In order to handle settings without a priori knowledge of the environment, we introduce a framework that can learn the optimal decomposition from model-free interactions with the environment. Our method uses a task-conditioned architecture to simultaneously learn an optimal decomposition and the corresponding agents' policies for each sub-task. In doing so, we remove the need for a human to manually design the optimal decomposition while maintaining the sample-efficiency benefits of improved credit assignment. We provide experimental results in several deep reinforcement learning settings, demonstrating the efficacy of our approach. Our results indicate that our approach succeeds even in environments with codependent agent dynamics, enabling synchronous multi-agent learning not achievable in previous works.

Learning Symbolic Task Decompositions for Multi-Agent Teams

TL;DR

The paper tackles the credit assignment problem in cooperative multi-agent reinforcement learning by automatically learning how to decompose a complex task into sub-tasks using reward machines. It introduces LOTaD, which simultaneously searches over a set of candidate task decompositions and trains task-conditioned policies for each sub-task, guided by an upper confidence bound strategy to balance exploration and exploitation. The approach relaxes the assumption of independent agent dynamics by providing a global view of the overall task and incentivizing coordination, enabling effective learning even in environments with codependent dynamics. Experimental results across Repairs, Buttons, and Overcooked domains show that LOTaD outperforms baselines, improves sample efficiency, and demonstrates the practicality of automated symbolic task decomposition for multi-agent teams.

Abstract

One approach for improving sample efficiency in cooperative multi-agent learning is to decompose overall tasks into sub-tasks that can be assigned to individual agents. We study this problem in the context of reward machines: symbolic tasks that can be formally decomposed into sub-tasks. In order to handle settings without a priori knowledge of the environment, we introduce a framework that can learn the optimal decomposition from model-free interactions with the environment. Our method uses a task-conditioned architecture to simultaneously learn an optimal decomposition and the corresponding agents' policies for each sub-task. In doing so, we remove the need for a human to manually design the optimal decomposition while maintaining the sample-efficiency benefits of improved credit assignment. We provide experimental results in several deep reinforcement learning settings, demonstrating the efficacy of our approach. Our results indicate that our approach succeeds even in environments with codependent agent dynamics, enabling synchronous multi-agent learning not achievable in previous works.

Paper Structure

This paper contains 27 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Visualization of our learning framework. At each new episode of training, a selection method chooses a possible symbolic decomposition of the task and assigns sub-tasks to each agent in the team. As each agent learns the viability of different sub-tasks, our selection method simultaneously finds the optimal task decomposition.
  • Figure 2: (Top) The "Repairs" MDP with a team of 3 agents. (Bottom) A task completion reward machine (RM) encoding the task: agents must navigate the environment to visit the HQ control tower, and then visit a set of communication stations. The goal state of the RM is denoted by concentric circles.
  • Figure 3: A visualization of the information each agent receives using the policy architecture described in section \ref{['subsec:task_conditioned_setup']} for the Repairs task from Figure \ref{['fig:running_example']}. In addition to the observation gathered from the MDP, each agent's policy is conditioned on (1) the current state of the original RM task, (2) which decomposition is currently selected, and (3) the current state of their assigned sub-task RM within the selected decomposition.
  • Figure 4: Training curves for LOTaD and baseline methods in our experimental domains. Results are averaged over 5 random seeds.
  • Figure 5: Training curves for LOTaD in the Repairs Task and Cramped-Corridor environments demonstrating the effect of conditioning on the overall task state along with individual sub-task states for each agent.
  • ...and 2 more figures