Table of Contents
Fetching ...

On Stateful Value Factorization in Multi-Agent Reinforcement Learning

Enrico Marchesini, Andrea Baisero, Rupali Bhati, Christopher Amato

TL;DR

DuelMIX is introduced, a factorization algorithm that learns distinct per-agent utility estimators to improve performance and achieve full expressiveness and formally analyze the theory of using the state instead of the history in current methods---reconnecting theory and practice.

Abstract

Value factorization is a popular paradigm for designing scalable multi-agent reinforcement learning algorithms. However, current factorization methods make choices without full justification that may limit their performance. For example, the theory in prior work uses stateless (i.e., history) functions, while the practical implementations use state information -- making the motivating theory a mismatch for the implementation. Also, methods have built off of previous approaches, inheriting their architectures without exploring other, potentially better ones. To address these concerns, we formally analyze the theory of using the state instead of the history in current methods -- reconnecting theory and practice. We then introduce DuelMIX, a factorization algorithm that learns distinct per-agent utility estimators to improve performance and achieve full expressiveness. Experiments on StarCraft II micromanagement and Box Pushing tasks demonstrate the benefits of our intuitions.

On Stateful Value Factorization in Multi-Agent Reinforcement Learning

TL;DR

DuelMIX is introduced, a factorization algorithm that learns distinct per-agent utility estimators to improve performance and achieve full expressiveness and formally analyze the theory of using the state instead of the history in current methods---reconnecting theory and practice.

Abstract

Value factorization is a popular paradigm for designing scalable multi-agent reinforcement learning algorithms. However, current factorization methods make choices without full justification that may limit their performance. For example, the theory in prior work uses stateless (i.e., history) functions, while the practical implementations use state information -- making the motivating theory a mismatch for the implementation. Also, methods have built off of previous approaches, inheriting their architectures without exploring other, potentially better ones. To address these concerns, we formally analyze the theory of using the state instead of the history in current methods -- reconnecting theory and practice. We then introduce DuelMIX, a factorization algorithm that learns distinct per-agent utility estimators to improve performance and achieve full expressiveness. Experiments on StarCraft II micromanagement and Box Pushing tasks demonstrate the benefits of our intuitions.
Paper Structure (33 sections, 7 theorems, 33 equations, 9 figures, 4 tables)

This paper contains 33 sections, 7 theorems, 33 equations, 9 figures, 4 tables.

Key Result

Proposition 3.1

For a joint $Q:\mathcal{H}\times \mathcal{S}\times \mathcal{U} \rightarrow \mathbb{R}$ and individuals $\langle Q_i:H_i\times U_i\rightarrow \mathbb{R}\rangle_{i\in\mathcal{N}}$ such that the following holds: $\langle Q_i(h_i, u_i)\rangle_{i\in\mathcal{N}}$ are said to satisfy History-State IGM for $Q(\bm{h}, s, \bm{u})$.

Figures (9)

  • Figure 1: DuelMIX architecture: (i) agent dueling utility network structure (yellow); (ii) transformation module (green); (iii) mixing network architecture.
  • Figure 2: Learning curves for the stateful algorithms in BP.
  • Figure 3: Saliency map of DuelMIX (left) and QPLEX (right) left agent's state value with respect to the initial state.
  • Figure 4: Average return during training for stateful factorization algorithms in SMACLite maps.
  • Figure 5: State-QMIX architecture (image credit: qmix). (a) Mixing network; (b) QMIX architecture; (c) Individual utility networks. Note the use of state in the mixing network, which means that the output should more correctly be denoted as $Q_{tot}(\bm{\tau}, s, \bm{u})$. This discrepancy between theory and implementation may undermine the validity of the IGM principle to this practical implementation of QMIX.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Proposition 3.1: History-State IGM
  • Proposition 3.2: QMIX State Bias
  • Proposition 3.3: QPLEX State Bias
  • Proposition 4.1
  • Proposition 4.2: DuelMIX State Bias
  • Proposition : QMIX State Bias
  • Proposition
  • proof