Table of Contents
Fetching ...

Corruption-Robust Offline Two-Player Zero-Sum Markov Games

Andi Nika, Debmalya Mandal, Adish Singla, Goran Radanović

TL;DR

This work proposes robust versions of the Pessimistic Minimax Value Iteration algorithm, both under coverage on the corrupted data and under coverage only on the clean data, and shows that they achieve (near)-optimal suboptimality gap bounds with respect to $\epsilon$.

Abstract

We study data corruption robustness in offline two-player zero-sum Markov games. Given a dataset of realized trajectories of two players, an adversary is allowed to modify an $ε$-fraction of it. The learner's goal is to identify an approximate Nash Equilibrium policy pair from the corrupted data. We consider this problem in linear Markov games under different degrees of data coverage and corruption. We start by providing an information-theoretic lower bound on the suboptimality gap of any learner. Next, we propose robust versions of the Pessimistic Minimax Value Iteration algorithm, both under coverage on the corrupted data and under coverage only on the clean data, and show that they achieve (near)-optimal suboptimality gap bounds with respect to $ε$. We note that we are the first to provide such a characterization of the problem of learning approximate Nash Equilibrium policies in offline two-player zero-sum Markov games under data corruption.

Corruption-Robust Offline Two-Player Zero-Sum Markov Games

TL;DR

This work proposes robust versions of the Pessimistic Minimax Value Iteration algorithm, both under coverage on the corrupted data and under coverage only on the clean data, and shows that they achieve (near)-optimal suboptimality gap bounds with respect to .

Abstract

We study data corruption robustness in offline two-player zero-sum Markov games. Given a dataset of realized trajectories of two players, an adversary is allowed to modify an -fraction of it. The learner's goal is to identify an approximate Nash Equilibrium policy pair from the corrupted data. We consider this problem in linear Markov games under different degrees of data coverage and corruption. We start by providing an information-theoretic lower bound on the suboptimality gap of any learner. Next, we propose robust versions of the Pessimistic Minimax Value Iteration algorithm, both under coverage on the corrupted data and under coverage only on the clean data, and show that they achieve (near)-optimal suboptimality gap bounds with respect to . We note that we are the first to provide such a characterization of the problem of learning approximate Nash Equilibrium policies in offline two-player zero-sum Markov games under data corruption.
Paper Structure (26 sections, 26 theorems, 149 equations, 1 figure, 1 table, 1 algorithm)

This paper contains 26 sections, 26 theorems, 149 equations, 1 figure, 1 table, 1 algorithm.

Key Result

Theorem 1

For every algorithm $L$, there exists a Markov game $\mathcal{G}$, an instance of the corrupted dataset, corruption level $\epsilon$, and a data collecting distribution $\rho$, such that, with probability at least $1/4$, $L$ will find a no-better than $\Omega(Hd\epsilon)$-approximate NE policy pair

Figures (1)

  • Figure 1: Relationship between coverage assumptions. The minimal coverage requirements are single policy coverage and LRU coverage for the single-player and two-player settings, respectively. Arrows stand for implications. The middle (dashed) arrow denotes the restriction from the two-player to the single-player setting: when fixing the second player's policy, the LRU coverage assumption reduces to the uniform coverage assumption.

Theorems & Definitions (51)

  • Definition 1
  • Definition 2: Linear Markov games
  • Definition 3: Compliance of dataset
  • Definition 4
  • Theorem 1
  • Theorem 2
  • Remark 1
  • Remark 2
  • Lemma 1: chen2022online
  • Theorem 3
  • ...and 41 more