Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation

Junyan Liu; Haipeng Luo; Zihan Zhang; Lillian J. Ratliff

Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation

Junyan Liu, Haipeng Luo, Zihan Zhang, Lillian J. Ratliff

TL;DR

This paper tackles online learning in two-player uninformed Markov games where the opponent's actions are hidden, introducing empirical Nash-value regret (Enr), a stronger metric than Nr that reduces to external regret under a fixed opponent. It analyzes epoch-V-learning to obtain a bound of $\tilde{O}(\eta C + \sqrt{K/\eta})$ for Enr and then designs a parameter-free meta-algorithm that adaptively restarts epoch-V-learning to achieve $\tilde{O}(\min\{ \sqrt{K}+(CK)^{1/3}, \sqrt{LK} \})$ for Enr, where $C$ captures non-stationarity and $L$ counts opponent policy switches. This approach recovers the known $\tilde{O}(\sqrt{K})$ external regret in stationary settings and avoids the worst-case $\tilde{O}(K^{2/3})$ regret by automatically adapting to the level of non-stationarity. The results provide a principled interpolation between regimes and open pathways for further tightening bounds or extending to broader multi-agent online learning scenarios.

Abstract

We study online learning in two-player uninformed Markov games, where the opponent's actions and policies are unobserved. In this setting, Tian et al. (2021) show that achieving no-external-regret is impossible without incurring an exponential dependence on the episode length $H$. They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret $O(K^{2/3})$ after $K$ episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus $O(\sqrt{K})$ external regret is well-known to be achievable, their result is still the worse rate $O(K^{2/3})$ on a weaker metric. In this work, we fully address both limitations. First, we introduce empirical Nash-value regret, a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new metric, we propose a parameter-free algorithm that achieves an $O(\min \{\sqrt{K} + (CK)^{1/3},\sqrt{LK}\})$ regret bound, where $C$ quantifies the variance of the opponent's policies and $L$ denotes the number of policy switches (both at most $O(K)$). Therefore, our results not only recover the two extremes -- $O(\sqrt{K})$ external regret when the opponent is fixed and $O(K^{2/3})$ Nash-value regret in the worst case -- but also smoothly interpolate between these extremes by automatically adapting to the opponent's non-stationarity. We achieve so by first providing a new analysis of the epoch-based V-learning algorithm by Mao et al. (2022), establishing an $O(ηC + \sqrt{K/η})$ regret bound, where $η$ is the epoch incremental factor. Next, we show how to adaptively restart this algorithm with an appropriate $η$ in response to the potential non-stationarity of the opponent, eventually achieving our final results.

Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation

TL;DR

for Enr and then designs a parameter-free meta-algorithm that adaptively restarts epoch-V-learning to achieve

for Enr, where

captures non-stationarity and

counts opponent policy switches. This approach recovers the known

external regret in stationary settings and avoids the worst-case

regret by automatically adapting to the level of non-stationarity. The results provide a principled interpolation between regimes and open pathways for further tightening bounds or extending to broader multi-agent online learning scenarios.

Abstract

. They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret

after

episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus

external regret is well-known to be achievable, their result is still the worse rate

on a weaker metric. In this work, we fully address both limitations. First, we introduce empirical Nash-value regret, a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new metric, we propose a parameter-free algorithm that achieves an

regret bound, where

quantifies the variance of the opponent's policies and

denotes the number of policy switches (both at most

). Therefore, our results not only recover the two extremes --

external regret when the opponent is fixed and

Nash-value regret in the worst case -- but also smoothly interpolate between these extremes by automatically adapting to the opponent's non-stationarity. We achieve so by first providing a new analysis of the epoch-based V-learning algorithm by Mao et al. (2022), establishing an

regret bound, where

is the epoch incremental factor. Next, we show how to adaptively restart this algorithm with an appropriate

in response to the potential non-stationarity of the opponent, eventually achieving our final results.

Paper Structure (32 sections, 25 theorems, 107 equations, 2 algorithms)

This paper contains 32 sections, 25 theorems, 107 equations, 2 algorithms.

Introduction
Contributions.
Related work.
Preliminaries
Base Algorithm: Epoch V-learning and Analysis
Epoch schedule.
Optimistic Nash-value estimate.
Adversarial bandit subroutine.
Main Results for Epoch V-Learning
Proof Sketch of thm:reg_bound_evol
Bounding $\sum_k (2)_{h}^k$.
Bounding $\sum_k (1)_{h}^k$.
Adapting to Unknown Non-Stationarity: A Meta-Algorithm
Algorithm and Main Results
High-level ideas.
...and 17 more sections

Key Result

Theorem 3

Suppose that the opponent is oblivious and $K \geq H|S|$. If we run alg:epoch_V_ol with $\eta \in \left[|S|/K, 1/H\right]$ and the adversarial bandit subroutine is instantiated with ex:adv_subroutine (so $\iota=\Theta(H^2|A|\log(HK|A||S|/\delta))$), then with probability at least $1-\delta$, the est

Theorems & Definitions (49)

Example 2
Theorem 3
Corollary 4: External regret under a stationary opponent
Lemma 5
Lemma 6
Theorem 7
Remark 8
Lemma 9: Restatement of lem:bound_num_epoch
proof
Definition 10
...and 39 more

Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation

TL;DR

Abstract

Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (49)