Table of Contents
Fetching ...

On Centralized Critics in Multi-Agent Reinforcement Learning

Xueguang Lyu, Andrea Baisero, Yuchen Xiao, Brett Daley, Christopher Amato

TL;DR

This paper challenges the prevailing view that centralized critics universally improve learning in decentralized multi-agent reinforcement learning. By formalizing histories, states, and their joint distributions in Dec-POMDPs, it proves that history-based centralized critics are unbiased with respect to decentralized history-based gradients, yet do not inherently enhance cooperation and can increase policy-gradient variance. State-based critics, in contrast, can introduce biased gradients or higher variance, depending on the environment, while history-state critics offer a practical trade-off by leveraging state information without inducing bias. Empirically, the authors show that centralized critics often underperform in partially observable settings due to higher variance and cooperation pathologies, though history-state critics frequently provide robust performance across diverse tasks. The work provides practical guidance for critic selection (favoring history-state critics as a safe default) and contributes formal tools for bias-variance analysis in CTDE MARL.

Abstract

Centralized Training for Decentralized Execution where agents are trained offline in a centralized fashion and execute online in a decentralized manner, has become a popular approach in Multi-Agent Reinforcement Learning (MARL). In particular, it has become popular to develop actor-critic methods that train decentralized actors with a centralized critic where the centralized critic is allowed access global information of the entire system, including the true system state. Such centralized critics are possible given offline information and are not used for online execution. While these methods perform well in a number of domains and have become a de facto standard in MARL, using a centralized critic in this context has yet to be sufficiently analyzed theoretically or empirically. In this paper, we therefore formally analyze centralized and decentralized critic approaches, and analyze the effect of using state-based critics in partially observable environments. We derive theories contrary to the common intuition: critic centralization is not strictly beneficial, and using state values can be harmful. We further prove that, in particular, state-based critics can introduce unexpected bias and variance compared to history-based critics. Finally, we demonstrate how the theory applies in practice by comparing different forms of critics on a wide range of common multi-agent benchmarks. The experiments show practical issues such as the difficulty of representation learning with partial observability, which highlights why the theoretical problems are often overlooked in the literature.

On Centralized Critics in Multi-Agent Reinforcement Learning

TL;DR

This paper challenges the prevailing view that centralized critics universally improve learning in decentralized multi-agent reinforcement learning. By formalizing histories, states, and their joint distributions in Dec-POMDPs, it proves that history-based centralized critics are unbiased with respect to decentralized history-based gradients, yet do not inherently enhance cooperation and can increase policy-gradient variance. State-based critics, in contrast, can introduce biased gradients or higher variance, depending on the environment, while history-state critics offer a practical trade-off by leveraging state information without inducing bias. Empirically, the authors show that centralized critics often underperform in partially observable settings due to higher variance and cooperation pathologies, though history-state critics frequently provide robust performance across diverse tasks. The work provides practical guidance for critic selection (favoring history-state critics as a safe default) and contributes formal tools for bias-variance analysis in CTDE MARL.

Abstract

Centralized Training for Decentralized Execution where agents are trained offline in a centralized fashion and execute online in a decentralized manner, has become a popular approach in Multi-Agent Reinforcement Learning (MARL). In particular, it has become popular to develop actor-critic methods that train decentralized actors with a centralized critic where the centralized critic is allowed access global information of the entire system, including the true system state. Such centralized critics are possible given offline information and are not used for online execution. While these methods perform well in a number of domains and have become a de facto standard in MARL, using a centralized critic in this context has yet to be sufficiently analyzed theoretically or empirically. In this paper, we therefore formally analyze centralized and decentralized critic approaches, and analyze the effect of using state-based critics in partially observable environments. We derive theories contrary to the common intuition: critic centralization is not strictly beneficial, and using state values can be harmful. We further prove that, in particular, state-based critics can introduce unexpected bias and variance compared to history-based critics. Finally, we demonstrate how the theory applies in practice by comparing different forms of critics on a wide range of common multi-agent benchmarks. The experiments show practical issues such as the difficulty of representation learning with partial observability, which highlights why the theoretical problems are often overlooked in the literature.
Paper Structure (76 sections, 20 theorems, 90 equations, 9 figures, 10 tables, 6 algorithms)

This paper contains 76 sections, 20 theorems, 90 equations, 9 figures, 10 tables, 6 algorithms.

Key Result

Lemma 1

Value functions $Q^{\bm{\pi}}_i(h_i, a_i)$ and $Q^{\bm{\pi}}({\bm{h}}, {\bm{a}})$ are related by

Figures (9)

  • Figure 1: Climb Game empirical returns showing both decentralized and centralized critic methods succumb to the shadowed equilibrium problem (showing mean and standard deviation over 50 runs per method).
  • Figure 2: Q value for updating $\pi(\textit{cereal})$ over time. Showing a larger variance in the values for IACC-H than for IAC, which does not reduce in the long term.
  • Figure 3: Performance (mean test return) comparison in Guess Game, plotting mean and standard deviation aggregating 40 runs per method; showing centralized critic cannot bias the actors towards the global optimum in the simplest situation.
  • Figure 4: Performances of IAC and IACC-H in different domains (showing mean and standard deviation over 20 runs per method).
  • Figure 5: Performance comparison in SMAC 3m domain and cooperative navigation domains. In these domains, agents navigate to designated target locations for reward, and are penalized for collisions (showing mean and standard deviation over 20 runs per method).
  • ...and 4 more figures

Theorems & Definitions (38)

  • Lemma 1
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Theorem 4
  • proof
  • Theorem 5
  • ...and 28 more