Table of Contents
Fetching ...

Mitigating Relative Over-Generalization in Multi-Agent Reinforcement Learning

Ting Zhu, Yue Jin, Jeremie Houssineau, Giovanni Montana

TL;DR

This work introduces MaxMax Q-Learning, which employs an iterative process of sampling and evaluating potential next states, selecting those with maximal Q-values for learning, and demonstrates that MMQ frequently outperforms existing baselines, exhibiting enhanced convergence and sample efficiency.

Abstract

In decentralized multi-agent reinforcement learning, agents learning in isolation can lead to relative over-generalization (RO), where optimal joint actions are undervalued in favor of suboptimal ones. This hinders effective coordination in cooperative tasks, as agents tend to choose actions that are individually rational but collectively suboptimal. To address this issue, we introduce MaxMax Q-Learning (MMQ), which employs an iterative process of sampling and evaluating potential next states, selecting those with maximal Q-values for learning. This approach refines approximations of ideal state transitions, aligning more closely with the optimal joint policy of collaborating agents. We provide theoretical analysis supporting MMQ's potential and present empirical evaluations across various environments susceptible to RO. Our results demonstrate that MMQ frequently outperforms existing baselines, exhibiting enhanced convergence and sample efficiency.

Mitigating Relative Over-Generalization in Multi-Agent Reinforcement Learning

TL;DR

This work introduces MaxMax Q-Learning, which employs an iterative process of sampling and evaluating potential next states, selecting those with maximal Q-values for learning, and demonstrates that MMQ frequently outperforms existing baselines, exhibiting enhanced convergence and sample efficiency.

Abstract

In decentralized multi-agent reinforcement learning, agents learning in isolation can lead to relative over-generalization (RO), where optimal joint actions are undervalued in favor of suboptimal ones. This hinders effective coordination in cooperative tasks, as agents tend to choose actions that are individually rational but collectively suboptimal. To address this issue, we introduce MaxMax Q-Learning (MMQ), which employs an iterative process of sampling and evaluating potential next states, selecting those with maximal Q-values for learning. This approach refines approximations of ideal state transitions, aligning more closely with the optimal joint policy of collaborating agents. We provide theoretical analysis supporting MMQ's potential and present empirical evaluations across various environments susceptible to RO. Our results demonstrate that MMQ frequently outperforms existing baselines, exhibiting enhanced convergence and sample efficiency.

Paper Structure

This paper contains 34 sections, 4 theorems, 31 equations, 12 figures, 5 tables, 1 algorithm.

Key Result

Theorem 4.1

Let $\mathcal{S}_{s, a_i}$ be the set of all possible next states as defined in eq:setNextStates and let $\hat{S}$ be a non-empty subset of $\mathcal{S}_{s, a_i}$. Let $s'^{*}$ and $\hat{s}'^*$ represent the best next states in the optimal and approximate regimes, respectively, that is Under Assumptions as:Rlip-as:order2 (see Appendix sec:proofs), if the Euclidean distance $d(s'^{*}, \hat{s}'^*)$

Figures (12)

  • Figure 1: Illustration of the MMQ update for two agents. Different positions of two agents in the rounded rectangles represent different possible next states $s'=(x_{b},x_{y})$. From the perspective of the blue agent: The yellow curve (P) represents the distribution of states with different yellow agent positions ($x_y$) in the replay buffer. The red curve (Q) represents the estimated Q-values for those possible next states. In the MMQ update, the blue agent selects samples for update based on the highest Q-value, marked by $\surd$. Importantly, this selection may not always coincide with the most frequently encountered scenarios (corresponding to the peak of the yellow curve) from past experiences, which may be sub-optimal.
  • Figure 2: Illustration of the set relationship among $\mathcal{S}_{s,a_i}$, $\hat{\mathcal{S}}_{s,a_i}$ and $\hat{\mathcal{S}}^M_{s,a_i}$ (denoted as $M$ above). The red star, $s'^*$, is the best next state in the real set, and $\tilde{s}'^*$, and $\hat{s}'^*$ represent the two states that are the closest to the best next states in $\hat{\mathcal{S}}_{s,a_i}$ and $\hat{\mathcal{S}}^M_{s,a_i}$. According to the triangle inequality, the distance between $s'^*$ and $\hat{s}'^*$, $d$($s'^*$,$\hat{s}'^*$), is upper bound by the sum of $d$($s'^*$,$\tilde{s}'^*$) and $d$($\tilde{s}'^*$,$\hat{s}'^*$)
  • Figure 3: Task visualization. (a) Differential Game(DG): agents need to cross a wide zero-reward area to move to the center to gain the optimal reward. (b) Half-Cheetah 2x3: the Half-Cheetah 2x3 scenario in MAmujoco domain; (c) MPE scenarios; Cooperative navigation(CN): two agents need to enter the grey area of the target together to gain the reward, the solo entry would induce a penalty. CN + More penalty: Same task as CN but with more penalty for solo entry; CN + HT: Agents could choose to approach one of the two Targets with different reward settings; CN+HA: same task as CN but two agents have different sizes and velocity; Predator-Prey(PP): two agents need to enter the grey area of a pre-trained prey. Sequential Task: two agents need to first go through the grey area of Target B and then enter the grey of Target A with the same RO reward design as CN.
  • Figure 4: Performance comparison for two-agents setting in DG, MPE scenarios and Half-Cheetah
  • Figure 5: (a) Ablation study for different sample number $M$ in three tasks; (b) Percentage of each dim of true next states fall within the predicted quantile bound for three tasks
  • ...and 7 more figures

Theorems & Definitions (8)

  • Theorem 4.1
  • Theorem 4.2
  • Theorem B.1
  • proof
  • Lemma B.2
  • proof
  • proof : Proof of Theorem \ref{['conv']}
  • proof : Proof of Theorem \ref{['distance']}