Table of Contents
Fetching ...

Learning in Markov Games with Adaptive Adversaries: Policy Regret, Fundamental Barriers, and Efficient Algorithms

Thanh Nguyen-Tang, Raman Arora

TL;DR

This work studies learning in two-player Markov games against adaptive opponents by adopting policy regret $PR(T)$ as a counterfactual performance benchmark. It proves strong negative results for unrestricted/adaptive adversaries and derives lower bounds illustrating inherent hardness, then introduces a consistency-based assumption on adversaries to enable learnability. Under this assumption, two algorithms are developed: OPO-OMLE for memory length $m=1$ and APE-OVE for general $m$, both achieving sublinear $PR(T)$ with a focus on sqrt{T} rates in memory-bounded, stationary, and consistent settings. The paper provides detailed regret guarantees, instance-dependent terms (notably the minimum positive visitation probability $d^*$), and discusses gaps and future directions such as computation considerations and function approximation for scalable MG learning against adaptive opponents.

Abstract

We study learning in a dynamically evolving environment modeled as a Markov game between a learner and a strategic opponent that can adapt to the learner's strategies. While most existing works in Markov games focus on external regret as the learning objective, external regret becomes inadequate when the adversaries are adaptive. In this work, we focus on \emph{policy regret} -- a counterfactual notion that aims to compete with the return that would have been attained if the learner had followed the best fixed sequence of policy, in hindsight. We show that if the opponent has unbounded memory or if it is non-stationary, then sample-efficient learning is not possible. For memory-bounded and stationary, we show that learning is still statistically hard if the set of feasible strategies for the learner is exponentially large. To guarantee learnability, we introduce a new notion of \emph{consistent} adaptive adversaries, wherein, the adversary responds similarly to similar strategies of the learner. We provide algorithms that achieve $\sqrt{T}$ policy regret against memory-bounded, stationary, and consistent adversaries.

Learning in Markov Games with Adaptive Adversaries: Policy Regret, Fundamental Barriers, and Efficient Algorithms

TL;DR

This work studies learning in two-player Markov games against adaptive opponents by adopting policy regret as a counterfactual performance benchmark. It proves strong negative results for unrestricted/adaptive adversaries and derives lower bounds illustrating inherent hardness, then introduces a consistency-based assumption on adversaries to enable learnability. Under this assumption, two algorithms are developed: OPO-OMLE for memory length and APE-OVE for general , both achieving sublinear with a focus on sqrt{T} rates in memory-bounded, stationary, and consistent settings. The paper provides detailed regret guarantees, instance-dependent terms (notably the minimum positive visitation probability ), and discusses gaps and future directions such as computation considerations and function approximation for scalable MG learning against adaptive opponents.

Abstract

We study learning in a dynamically evolving environment modeled as a Markov game between a learner and a strategic opponent that can adapt to the learner's strategies. While most existing works in Markov games focus on external regret as the learning objective, external regret becomes inadequate when the adversaries are adaptive. In this work, we focus on \emph{policy regret} -- a counterfactual notion that aims to compete with the return that would have been attained if the learner had followed the best fixed sequence of policy, in hindsight. We show that if the opponent has unbounded memory or if it is non-stationary, then sample-efficient learning is not possible. For memory-bounded and stationary, we show that learning is still statistically hard if the set of feasible strategies for the learner is exponentially large. To guarantee learnability, we introduce a new notion of \emph{consistent} adaptive adversaries, wherein, the adversary responds similarly to similar strategies of the learner. We provide algorithms that achieve policy regret against memory-bounded, stationary, and consistent adversaries.

Paper Structure

This paper contains 35 sections, 20 theorems, 64 equations, 1 table, 4 algorithms.

Key Result

Theorem 1

For any learner, there exists an adaptive adversary and a Markov game instance such that ${\textrm{PR}}(T) = \Omega(T)$.

Theorems & Definitions (47)

  • Example 3.1: Nash equilibrium
  • Theorem 1
  • Definition 1: $m$-memory bounded adversaries
  • Theorem 2
  • Definition 2: Stationary adversaries
  • Theorem 3
  • Definition 3: Consistent adversaries
  • Remark 1: $\zeta$-approximately consistent adversaries
  • Remark 2
  • Remark 3
  • ...and 37 more