Table of Contents
Fetching ...

Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability

Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P. How, John Vian

TL;DR

The paper tackles multi-task multi-agent reinforcement learning under partial observability, where task identities are not observable during execution. It introduces a two-phase approach: Phase I learns single-task decentralized MARL with Dec-HDRQNs and CERTs to stabilize training under non-stationarity, while Phase II distills these specialized policies into a unified multi-task network that performs across related tasks without task IDs. Key contributions include Dec-HDRQNs, Concurrent Experience Replay Trajectories, and a distillation framework that yields a task-agnostic policy with strong coordination in sparse-reward Dec-POMDPs. The work demonstrates decentralized coordination and robust generalization, offering a practical methodology for real-world multi-agent systems with partial observability and limited communication.

Abstract

Many real-world tasks involve multiple agents with partial observability and limited communication. Learning is challenging in these settings due to local viewpoints of agents, which perceive the world as non-stationary due to concurrently-exploring teammates. Approaches that learn specialized policies for individual tasks face problems when applied to the real world: not only do agents have to learn and store distinct policies for each task, but in practice identities of tasks are often non-observable, making these approaches inapplicable. This paper formalizes and addresses the problem of multi-task multi-agent reinforcement learning under partial observability. We introduce a decentralized single-task learning approach that is robust to concurrent interactions of teammates, and present an approach for distilling single-task policies into a unified policy that performs well across multiple related tasks, without explicit provision of task identity.

Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability

TL;DR

The paper tackles multi-task multi-agent reinforcement learning under partial observability, where task identities are not observable during execution. It introduces a two-phase approach: Phase I learns single-task decentralized MARL with Dec-HDRQNs and CERTs to stabilize training under non-stationarity, while Phase II distills these specialized policies into a unified multi-task network that performs across related tasks without task IDs. Key contributions include Dec-HDRQNs, Concurrent Experience Replay Trajectories, and a distillation framework that yields a task-agnostic policy with strong coordination in sparse-reward Dec-POMDPs. The work demonstrates decentralized coordination and robust generalization, offering a practical methodology for real-world multi-agent systems with partial observability and limited communication.

Abstract

Many real-world tasks involve multiple agents with partial observability and limited communication. Learning is challenging in these settings due to local viewpoints of agents, which perceive the world as non-stationary due to concurrently-exploring teammates. Approaches that learn specialized policies for individual tasks face problems when applied to the real world: not only do agents have to learn and store distinct policies for each task, but in practice identities of tasks are often non-observable, making these approaches inapplicable. This paper formalizes and addresses the problem of multi-task multi-agent reinforcement learning under partial observability. We introduce a decentralized single-task learning approach that is robust to concurrent interactions of teammates, and present an approach for distilling single-task policies into a unified policy that performs well across multiple related tasks, without explicit provision of task identity.

Paper Structure

This paper contains 25 sections, 6 equations, 13 figures.

Figures (13)

  • Figure 1: Concurrent training samples for MARL. Each cube signifies an experience tuple $\langle o_t^{(i)}, a_t^{(i)}, r_t, o_{t+1}^{(i)} \rangle$. Axes $e$, $t$, $i$ correspond to episode, timestep, and agent indices, respectively.
  • Figure 2: Task specialization for MAMT domain with $n=2$ agents, $P_f = 0.3$. (a) Without hysteresis Dec-DRQN policies destabilize in the $5 \times 5$ task and fails to learn in the $6 \times 6$ and $7 \times 7$ tasks. (b) Dec-HDRQN (our approach) performs well in all tasks.
  • Figure 3: The advantage of hysteresis is even more pronounced for MAMT with $n=3$ agents. $P_f = 0.6$ for $3\times3$ task, and $P_f = 0.1$ for $4\times 4$ task. Dec-HDRQN indicated by (H).
  • Figure 4: Dec-HDRQN sensitivity to learning rate $\beta$ ($6 \times 6$ MAMT domain, $n=2$ agents, $P_f=0.25$). Anticipated return $Q(o_0,a_0)$ upper bounds actual return due to hysteretic optimism.
  • Figure 5: MT-MARL performance of the proposed Dec-HDRQN specialization/distillation approach (labeled as Distilled) and simultaneous learning approach (labeled as Multi). Multi-task policies for both approaches were trained on all MAMT tasks from $3 \times 3$ through $6 \times 6$. Performance shown only for $4 \times 4$ and $6 \times 6$ domains for clarity. Distilled approach shows specialization training (Phase I of approach) until 70K epochs, after which distillation is conducted (Phase II of approach). Letting the simultaneous learning approach run for up to 500K episodes did not lead to significant performance improvement. By contrast, the performance of our approach during the distillation phase (which includes task identification) is almost as good as its performance during the specialization phase.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3