Table of Contents
Fetching ...

Consolidation via Policy Information Regularization in Deep RL for Multi-Agent Games

Tailia Malloy, Tim Klinger, Miao Liu, Matthew Riemer, Gerald Tesauro, Chris R. Sims

TL;DR

The paper addresses nonstationarity in multi-agent reinforcement learning by introducing a capacity-limited policy information constraint within MADDPG. It formalizes a mutual-information budget $\mathcal{I}(\pi(a|s)) \le \mathcal{C}$ and a reward-regularization weight $\beta$, linking policy complexity to generalization and consolidation. The authors propose MI approximation techniques for deterministic MADDPG policies and present the Capacity-Limited MADDPG algorithm that integrates these terms into the centralized critic framework. Empirical results across cooperative, competitive, and mixed environments show improved generalization and learning stability in most tasks, with certain mixed-task dynamics showing sensitivity to the information budget. This work presents an information-theoretic regularization framework that can mitigate forgetting and improve robustness in nonstationary MARL settings, with practical implications for scalable multi-agent control.

Abstract

This paper introduces an information-theoretic constraint on learned policy complexity in the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) reinforcement learning algorithm. Previous research with a related approach in continuous control experiments suggests that this method favors learning policies that are more robust to changing environment dynamics. The multi-agent game setting naturally requires this type of robustness, as other agents' policies change throughout learning, introducing a nonstationary environment. For this reason, recent methods in continual learning are compared to our approach, termed Capacity-Limited MADDPG. Results from experimentation in multi-agent cooperative and competitive tasks demonstrate that the capacity-limited approach is a good candidate for improving learning performance in these environments.

Consolidation via Policy Information Regularization in Deep RL for Multi-Agent Games

TL;DR

The paper addresses nonstationarity in multi-agent reinforcement learning by introducing a capacity-limited policy information constraint within MADDPG. It formalizes a mutual-information budget and a reward-regularization weight , linking policy complexity to generalization and consolidation. The authors propose MI approximation techniques for deterministic MADDPG policies and present the Capacity-Limited MADDPG algorithm that integrates these terms into the centralized critic framework. Empirical results across cooperative, competitive, and mixed environments show improved generalization and learning stability in most tasks, with certain mixed-task dynamics showing sensitivity to the information budget. This work presents an information-theoretic regularization framework that can mitigate forgetting and improve robustness in nonstationary MARL settings, with practical implications for scalable multi-agent control.

Abstract

This paper introduces an information-theoretic constraint on learned policy complexity in the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) reinforcement learning algorithm. Previous research with a related approach in continuous control experiments suggests that this method favors learning policies that are more robust to changing environment dynamics. The multi-agent game setting naturally requires this type of robustness, as other agents' policies change throughout learning, introducing a nonstationary environment. For this reason, recent methods in continual learning are compared to our approach, termed Capacity-Limited MADDPG. Results from experimentation in multi-agent cooperative and competitive tasks demonstrate that the capacity-limited approach is a good candidate for improving learning performance in these environments.

Paper Structure

This paper contains 17 sections, 15 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: Multi-Agent environments used as a test bed for the capacity-limited version of MADDPG. LEFT: Cooperative Navigation task, 3 good agents (light blue) spread to each of the 3 targets (dark grey). Only covering 1 or 2 targets will not maximize the reward. LEFT-CENTER: Cooperative Communication task, the speaker (Grey) communicates with the listener (light green) which of the three locations (green, red, or blue) is the target for this episode. The listener must move to the target location indicated by the speaker and learn which location corresponds to the information being communicated by the speaker. RIGHT-CENTER: Competitive Keep Away environment in which the adversary (light red) must push the good agent (light green) away from the target location. The adversary does not know at the beginning of an episode where the target location for the good agent is (green or blue) and must infer it from their behaviour, if the good agent reaches the target location before the adversary it will be able to remain there. RIGHT: Mixed cooperative and competitive task where 2 good agents (light blue) move towards the target location (green) and prevent the adversary (light red) from moving there. This can be done by tricking the adversary to move towards the dummy target (black) as the adversary cannot see which of the locations is the target.
  • Figure 2: MADDPG and CL-MADDPG (labelled as CL-MA) training results in the cooperative communication environment. Green line shows a CL-MA agent with a $\beta$ coefficient of 1e-3. Orange line shows the same scenario with a coefficient of 1e-2. Blue line shows the traditional MADDPG agent. Averages are shown over 5 seeds, with a rolling average window of 5 episodes used to smooth the curve. Error bars represent 99% confidence interval.
  • Figure 3: MADDPG and CL-MADDPG (labelled as CL-MA) training results in the Cooperative Communication environment. Green line shows a CL-MA agent with a $\beta$ coefficient of 1e-3. Orange line shows the same scenario with a coefficient of 1e-2. Blue line shows the traditional MADDPG agent. Averages are shown over 5 seeds, with a rolling average window of 5 episodes used to smooth the curve. Error bars represent 99% confidence interval.
  • Figure 4: Good MADDPG vs Adversarial CL-MADDPG and vice versa training results in the competitive Push environment. All results report the reward of the 'good' agent. Blue represents a good CL-MADDPG agent with $\beta$ coefficient of 1e-2 against a traditional MADDPG agent. Orange represents the same scenario with a 1e-3 coefficient. Red represents a good MADDPG agent against a CL-MADDPG agent with a $\beta$ coefficient of 1e-3. Green represents the same scenario with a coefficient of 1e-2. Averages are shown over 5 seeds, with a rolling average window of 5 episodes used to smooth the curve. Error bars represent 99% confidence interval.
  • Figure 5: Good MADDPG vs Adversarial CL-MADDPG and vice versa training results in the physical deception environment. All results report the reward of the 'good' agent, for MADDPG agents the beta coefficient of the adversary is listed next to the model name in the legend. Colors are represented in the same manner as in Figure 3. Averages are shown over 5 seeds, with a rolling average window of 5 episodes used to smooth the curve. Error bars represent 99% confidence interval.