Table of Contents
Fetching ...

Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation

Junyan Liu, Haipeng Luo, Zihan Zhang, Lillian J. Ratliff

TL;DR

This paper tackles online learning in two-player uninformed Markov games where the opponent's actions are hidden, introducing empirical Nash-value regret (Enr), a stronger metric than Nr that reduces to external regret under a fixed opponent. It analyzes epoch-V-learning to obtain a bound of $\tilde{O}(\eta C + \sqrt{K/\eta})$ for Enr and then designs a parameter-free meta-algorithm that adaptively restarts epoch-V-learning to achieve $\tilde{O}(\min\{ \sqrt{K}+(CK)^{1/3}, \sqrt{LK} \})$ for Enr, where $C$ captures non-stationarity and $L$ counts opponent policy switches. This approach recovers the known $\tilde{O}(\sqrt{K})$ external regret in stationary settings and avoids the worst-case $\tilde{O}(K^{2/3})$ regret by automatically adapting to the level of non-stationarity. The results provide a principled interpolation between regimes and open pathways for further tightening bounds or extending to broader multi-agent online learning scenarios.

Abstract

We study online learning in two-player uninformed Markov games, where the opponent's actions and policies are unobserved. In this setting, Tian et al. (2021) show that achieving no-external-regret is impossible without incurring an exponential dependence on the episode length $H$. They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret $O(K^{2/3})$ after $K$ episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus $O(\sqrt{K})$ external regret is well-known to be achievable, their result is still the worse rate $O(K^{2/3})$ on a weaker metric. In this work, we fully address both limitations. First, we introduce empirical Nash-value regret, a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new metric, we propose a parameter-free algorithm that achieves an $O(\min \{\sqrt{K} + (CK)^{1/3},\sqrt{LK}\})$ regret bound, where $C$ quantifies the variance of the opponent's policies and $L$ denotes the number of policy switches (both at most $O(K)$). Therefore, our results not only recover the two extremes -- $O(\sqrt{K})$ external regret when the opponent is fixed and $O(K^{2/3})$ Nash-value regret in the worst case -- but also smoothly interpolate between these extremes by automatically adapting to the opponent's non-stationarity. We achieve so by first providing a new analysis of the epoch-based V-learning algorithm by Mao et al. (2022), establishing an $O(ηC + \sqrt{K/η})$ regret bound, where $η$ is the epoch incremental factor. Next, we show how to adaptively restart this algorithm with an appropriate $η$ in response to the potential non-stationarity of the opponent, eventually achieving our final results.

Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation

TL;DR

This paper tackles online learning in two-player uninformed Markov games where the opponent's actions are hidden, introducing empirical Nash-value regret (Enr), a stronger metric than Nr that reduces to external regret under a fixed opponent. It analyzes epoch-V-learning to obtain a bound of for Enr and then designs a parameter-free meta-algorithm that adaptively restarts epoch-V-learning to achieve for Enr, where captures non-stationarity and counts opponent policy switches. This approach recovers the known external regret in stationary settings and avoids the worst-case regret by automatically adapting to the level of non-stationarity. The results provide a principled interpolation between regimes and open pathways for further tightening bounds or extending to broader multi-agent online learning scenarios.

Abstract

We study online learning in two-player uninformed Markov games, where the opponent's actions and policies are unobserved. In this setting, Tian et al. (2021) show that achieving no-external-regret is impossible without incurring an exponential dependence on the episode length . They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret after episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus external regret is well-known to be achievable, their result is still the worse rate on a weaker metric. In this work, we fully address both limitations. First, we introduce empirical Nash-value regret, a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new metric, we propose a parameter-free algorithm that achieves an regret bound, where quantifies the variance of the opponent's policies and denotes the number of policy switches (both at most ). Therefore, our results not only recover the two extremes -- external regret when the opponent is fixed and Nash-value regret in the worst case -- but also smoothly interpolate between these extremes by automatically adapting to the opponent's non-stationarity. We achieve so by first providing a new analysis of the epoch-based V-learning algorithm by Mao et al. (2022), establishing an regret bound, where is the epoch incremental factor. Next, we show how to adaptively restart this algorithm with an appropriate in response to the potential non-stationarity of the opponent, eventually achieving our final results.
Paper Structure (32 sections, 25 theorems, 107 equations, 2 algorithms)

This paper contains 32 sections, 25 theorems, 107 equations, 2 algorithms.

Key Result

Theorem 3

Suppose that the opponent is oblivious and $K \geq H|S|$. If we run alg:epoch_V_ol with $\eta \in \left[|S|/K, 1/H\right]$ and the adversarial bandit subroutine is instantiated with ex:adv_subroutine (so $\iota=\Theta(H^2|A|\log(HK|A||S|/\delta))$), then with probability at least $1-\delta$, the est

Theorems & Definitions (49)

  • Example 2
  • Theorem 3
  • Corollary 4: External regret under a stationary opponent
  • Lemma 5
  • Lemma 6
  • Theorem 7
  • Remark 8
  • Lemma 9: Restatement of lem:bound_num_epoch
  • proof
  • Definition 10
  • ...and 39 more