Table of Contents
Fetching ...

Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning

Beyazit Yalcinkaya, Marcell Vazquez-Chanlatte, Ameesh Shah, Hanna Krasowski, Sanjit A. Seshia

TL;DR

This work presents ACC-MARL, a framework for learning task-conditioned, decentralized team policies for cooperative, temporal objectives, and shows that the value functions of learned policies can be used to assign tasks optimally at test time.

Abstract

We study the problem of learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks enables the decomposition of complex tasks into simpler sub-tasks that can be assigned to agents. However, existing approaches remain sample-inefficient and are limited to the single-task case. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify the main challenges to ACC-MARL's feasibility in practice, propose solutions, and prove the correctness of our approach. We further show that the value functions of learned policies can be used to assign tasks optimally at test time. Experiments show emergent task-aware, multi-step coordination among agents, e.g., pressing a button to unlock a door, holding the door, and short-circuiting tasks.

Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning

TL;DR

This work presents ACC-MARL, a framework for learning task-conditioned, decentralized team policies for cooperative, temporal objectives, and shows that the value functions of learned policies can be used to assign tasks optimally at test time.

Abstract

We study the problem of learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks enables the decomposition of complex tasks into simpler sub-tasks that can be assigned to agents. However, existing approaches remain sample-inefficient and are limited to the single-task case. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify the main challenges to ACC-MARL's feasibility in practice, propose solutions, and prove the correctness of our approach. We further show that the value functions of learned policies can be used to assign tasks optimally at test time. Experiments show emergent task-aware, multi-step coordination among agents, e.g., pressing a button to unlock a door, holding the door, and short-circuiting tasks.

Paper Structure

This paper contains 25 sections, 2 theorems, 22 equations, 14 figures, 2 tables, 4 algorithms.

Key Result

Theorem 1

Maximizing $J'_{\gamma}(\boldsymbol{\pi}')$ solves prob:dfa-marl as $\gamma \to 1^{-}$, i.e., where $J(\boldsymbol{\pi})$ and $J'_{\gamma}(\boldsymbol{\pi}')$ are from eqn:obj_vanillaeqn:obj1, respectively.

Figures (14)

  • Figure 1: Motivating example Buttons-2 -- details are below.
  • Figure 2: An overview of ACC-MARL with Rooms-2 environment given on the right. During training, we sample tasks from a prior and randomly assign them. At each step, tasks are mapped to RAD Embeddings and passed to decentralized policies. Each agent conditions on these embeddings to predict its action. At test time, we use learned value functions to assign tasks optimally.
  • Figure 3: Considered four-agent variations of TokenEnv are given in (a) and (b), and (c) presents a sample RAD DFA.
  • Figure 4: Success probabilities of learned policies throughout training, reported over 5 random seeds -- shaded regions indicate standard deviation. "RAD Embd; PBRS" refers to Markovian policies conditioning on pretrained RAD Embeddings and trained with the shaped reward, i.e., the full solution proposed in \ref{['sec:dfa-marl']}. We present the results with the history-dependent baseline in \ref{['fig:plots_lstm']} in the Appendix. We report the results in terms of discounted returns in \ref{['fig:disc_return_mean', 'fig:disc_return_mean_lstm']} in the Appendix.
  • Figure 5: Policy architecture of an agent $i$.
  • ...and 9 more figures

Theorems & Definitions (4)

  • Definition 1: Markov Game
  • Definition 2: Deterministic Finite Automaton
  • Theorem 1
  • Theorem 1