Table of Contents
Fetching ...

The Benefits of Power Regularization in Cooperative Reinforcement Learning

Michelle Li, Michael Dennis

TL;DR

This work introduces power regularization for cooperative MARL by defining a 1-step adversarial power measure and a linear trade-off objective $U_i(\pi|s)=U_i^{task}(\pi|s)+\\lambda U_i^{power}(\pi|s)$ to discourage power concentration. It proves that power-regularizing equilibria exist for any $\\lambda$ by mapping to a $p$-adversarial game and applying Nash's theorem, and provides two training algorithms, SBPR and PRIM, to optimize the regularized objective. Through experiments in small games and Overcooked variants, the authors show that both methods can balance task reward and power, reducing vulnerability to off-policy deviations and extreme events, with PRIM often offering superior stability at low regularization levels. The results highlight the practical value of explicitly regulating power to improve robustness and cooperative performance in multi-agent systems.

Abstract

Cooperative Multi-Agent Reinforcement Learning (MARL) algorithms, trained only to optimize task reward, can lead to a concentration of power where the failure or adversarial intent of a single agent could decimate the reward of every agent in the system. In the context of teams of people, it is often useful to explicitly consider how power is distributed to ensure no person becomes a single point of failure. Here, we argue that explicitly regularizing the concentration of power in cooperative RL systems can result in systems which are more robust to single agent failure, adversarial attacks, and incentive changes of co-players. To this end, we define a practical pairwise measure of power that captures the ability of any co-player to influence the ego agent's reward, and then propose a power-regularized objective which balances task reward and power concentration. Given this new objective, we show that there always exists an equilibrium where every agent is playing a power-regularized best-response balancing power and task reward. Moreover, we present two algorithms for training agents towards this power-regularized objective: Sample Based Power Regularization (SBPR), which injects adversarial data during training; and Power Regularization via Intrinsic Motivation (PRIM), which adds an intrinsic motivation to regulate power to the training objective. Our experiments demonstrate that both algorithms successfully balance task reward and power, leading to lower power behavior than the baseline of task-only reward and avoid catastrophic events in case an agent in the system goes off-policy.

The Benefits of Power Regularization in Cooperative Reinforcement Learning

TL;DR

This work introduces power regularization for cooperative MARL by defining a 1-step adversarial power measure and a linear trade-off objective to discourage power concentration. It proves that power-regularizing equilibria exist for any by mapping to a -adversarial game and applying Nash's theorem, and provides two training algorithms, SBPR and PRIM, to optimize the regularized objective. Through experiments in small games and Overcooked variants, the authors show that both methods can balance task reward and power, reducing vulnerability to off-policy deviations and extreme events, with PRIM often offering superior stability at low regularization levels. The results highlight the practical value of explicitly regulating power to improve robustness and cooperative performance in multi-agent systems.

Abstract

Cooperative Multi-Agent Reinforcement Learning (MARL) algorithms, trained only to optimize task reward, can lead to a concentration of power where the failure or adversarial intent of a single agent could decimate the reward of every agent in the system. In the context of teams of people, it is often useful to explicitly consider how power is distributed to ensure no person becomes a single point of failure. Here, we argue that explicitly regularizing the concentration of power in cooperative RL systems can result in systems which are more robust to single agent failure, adversarial attacks, and incentive changes of co-players. To this end, we define a practical pairwise measure of power that captures the ability of any co-player to influence the ego agent's reward, and then propose a power-regularized objective which balances task reward and power concentration. Given this new objective, we show that there always exists an equilibrium where every agent is playing a power-regularized best-response balancing power and task reward. Moreover, we present two algorithms for training agents towards this power-regularized objective: Sample Based Power Regularization (SBPR), which injects adversarial data during training; and Power Regularization via Intrinsic Motivation (PRIM), which adds an intrinsic motivation to regulate power to the training objective. Our experiments demonstrate that both algorithms successfully balance task reward and power, leading to lower power behavior than the baseline of task-only reward and avoid catastrophic events in case an agent in the system goes off-policy.
Paper Structure (16 sections, 2 theorems, 7 equations, 5 figures, 5 tables, 2 algorithms)

This paper contains 16 sections, 2 theorems, 7 equations, 5 figures, 5 tables, 2 algorithms.

Key Result

theorem 1

Let $G$ be a finite, discrete Markov Game, then a $\lambda$-power regularizing equilibrium exists for any $\lambda$.

Figures (5)

  • Figure 1: Extensive Form diagrams of a simple 1 timestep game and its corresponding p-Adversarial game at s where Nature chooses an ego agent and whether to use the on-policy or adversarial co-player toward the ego agent. The dashed lines encircle nodes in the same information set for Player 1 because agents act simultaneously. We omit a final layer of Nature nodes that model the probabilistic transition function.
  • Figure 2: Power-regularized objective values achieved by different actions in small environments.
  • Figure 3: Overcooked Close-Pot-Far-Pot. Agents can use the shared middle pot or their private pots. Using the middle pot is faster but incurs high power (see (b)) where one agent can mess up the other's work by putting in a wrong ingredient.
  • Figure 4: Experimental Results in Overcooked Close-Pot-Far-Pot. Error bars are standard deviations over 5 trials.
  • Figure 5: Comparison of PRIM, SBPR, and Task-Only Baseline in Overcooked Explosion with $\lambda=0.0001$. In some runs only one agent is visible because the plots coincide completely; the powers incurred are too small to be distinguishable after multiplying by $\lambda$. Error bars are standard deviations over 5 trials except for SBPR interact oracle which only has 3 trials.

Theorems & Definitions (6)

  • definition 1: 1-step adversarial power
  • definition 2
  • definition 3: $\lambda$-Power Regularizing Equilibrium
  • theorem 1
  • definition 4: p-Adversarial Game of $G$ at $s$
  • theorem 2