Table of Contents
Fetching ...

A Black-box Approach for Non-stationary Multi-agent Reinforcement Learning

Haozhe Jiang, Qiwen Cui, Zhihan Xiong, Maryam Fazel, Simon S. Du

TL;DR

A versatile black-box approach applicable to a broad spectrum of problems, such as general-sum games, potential games, and Markov games, when equipped with appropriate learning and testing oracles for stationary environments is proposed.

Abstract

We investigate learning the equilibria in non-stationary multi-agent systems and address the challenges that differentiate multi-agent learning from single-agent learning. Specifically, we focus on games with bandit feedback, where testing an equilibrium can result in substantial regret even when the gap to be tested is small, and the existence of multiple optimal solutions (equilibria) in stationary games poses extra challenges. To overcome these obstacles, we propose a versatile black-box approach applicable to a broad spectrum of problems, such as general-sum games, potential games, and Markov games, when equipped with appropriate learning and testing oracles for stationary environments. Our algorithms can achieve $\widetilde{O}\left(Δ^{1/4}T^{3/4}\right)$ regret when the degree of nonstationarity, as measured by total variation $Δ$, is known, and $\widetilde{O}\left(Δ^{1/5}T^{4/5}\right)$ regret when $Δ$ is unknown, where $T$ is the number of rounds. Meanwhile, our algorithm inherits the favorable dependence on number of agents from the oracles. As a side contribution that may be independent of interest, we show how to test for various types of equilibria by a black-box reduction to single-agent learning, which includes Nash equilibria, correlated equilibria, and coarse correlated equilibria.

A Black-box Approach for Non-stationary Multi-agent Reinforcement Learning

TL;DR

A versatile black-box approach applicable to a broad spectrum of problems, such as general-sum games, potential games, and Markov games, when equipped with appropriate learning and testing oracles for stationary environments is proposed.

Abstract

We investigate learning the equilibria in non-stationary multi-agent systems and address the challenges that differentiate multi-agent learning from single-agent learning. Specifically, we focus on games with bandit feedback, where testing an equilibrium can result in substantial regret even when the gap to be tested is small, and the existence of multiple optimal solutions (equilibria) in stationary games poses extra challenges. To overcome these obstacles, we propose a versatile black-box approach applicable to a broad spectrum of problems, such as general-sum games, potential games, and Markov games, when equipped with appropriate learning and testing oracles for stationary environments. Our algorithms can achieve regret when the degree of nonstationarity, as measured by total variation , is known, and regret when is unknown, where is the number of rounds. Meanwhile, our algorithm inherits the favorable dependence on number of agents from the oracles. As a side contribution that may be independent of interest, we show how to test for various types of equilibria by a black-box reduction to single-agent learning, which includes Nash equilibria, correlated equilibria, and coarse correlated equilibria.
Paper Structure (26 sections, 25 theorems, 80 equations, 2 figures, 2 tables, 4 algorithms)

This paper contains 26 sections, 25 theorems, 80 equations, 2 figures, 2 tables, 4 algorithms.

Key Result

Proposition 1

With probability $1-T\delta$, the regret of Algorithm alg:warmup satisfies

Figures (2)

  • Figure 1: Consider a two-player cooperative game. Both players have access to action space $\{a,b\}$ and the corresponding rewards are shown in the picture. Assume we have found NE $(b,b)$. If we want to make sure $(a,b)$ has not become a best response for player 1, we have to play $(a,b)$ for ${\color{black}1/\varepsilon^2}$ times. However the regret of $(a,b)$ is 1, so this process induces ${\color{black}1/\varepsilon^2}$ regret.
  • Figure 2: This is an example of the scheduling for committing phase with length $16,Q=2,c=1$. The horizontal lines represent the scheduled $\textsc{Test\_EQ}$ except for the black line on the top which represent the time horizon. Different colors represent $\textsc{Test\_EQ}$ for different $\epsilon(q)$. The bold parts of a line represent the active parts and the other parts are the paused parts. The colored vertical lines represent the possible starting points of $\textsc{Test\_EQ}$ for each level. The cross at the last episode indicates the $\textsc{Test\_EQ}$ is aborted because it spans $2^{c+q}=8$ episodes but has only run $3<2^q$ episodes. The bold part of the black line indicates that at this episode we commit to the learned policy and there is no $\textsc{Test\_EQ}$ running.

Theorems & Definitions (56)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Proposition 1
  • Remark 1
  • Corollary 1
  • Example 1
  • Proposition 2
  • ...and 46 more