Table of Contents
Fetching ...

Two-Player Zero-Sum Games with Bandit Feedback

Elif Yılmaz, Christos Dimitrakakis

TL;DR

This work addresses learning pure Nash equilibria in finite two-player zero-sum games with bandit feedback, where the payoff matrix is unknown. It adapts the Explore-Then-Commit paradigm to TPZSGs and introduces three algorithms—ETC-TPZSG, ETC-TPZSG-AE, and ETC-TPZSG-AE-NUE—that balance exploration and commitment to equilibrium strategies. The authors derive instance-dependent regret bounds, including $O(Δ+ \sqrt{T})$ for ETC and $O(\log (T Δ^2)/Δ)$ for the elimination-based variants, and validate these results with experiments showing improved performance over baselines like Tsallis-INF. The results demonstrate that ETC-based approaches can effectively learn pure NE in adversarial environments, offering simple, analyzable methods with practical impact for learning in zero-sum games under partial feedback.

Abstract

We study a two-player zero-sum game in which the row player aims to maximize their payoff against an adversarial column player, under an unknown payoff matrix estimated through bandit feedback. We propose three algorithms based on the Explore-Then-Commit framework. The first adapts it to zero-sum games, the second incorporates adaptive elimination that leverages the $\varepsilon$-Nash Equilibrium property to efficiently select the optimal action pair, and the third extends the elimination algorithm by employing non-uniform exploration. Our objective is to demonstrate the applicability of ETC in a zero-sum game setting by focusing on learning pure strategy Nash Equilibria. A key contribution of our work is a derivation of instance-dependent upper bounds on the expected regret of our proposed algorithms, which has received limited attention in the literature on zero-sum games. Particularly, after $T$ rounds, we achieve an instance-dependent regret upper bounds of $O(Δ+ \sqrt{T})$ for ETC in zero-sum game setting and $O(\log (T Δ^2) / Δ)$ for the adaptive elimination algorithm and its variant with non-uniform exploration, where $Δ$ denotes the suboptimality gap. Therefore, our results indicate that ETC-based algorithms perform effectively in adversarial game settings, achieving regret bounds comparable to existing methods while providing insight through instance-dependent analysis.

Two-Player Zero-Sum Games with Bandit Feedback

TL;DR

This work addresses learning pure Nash equilibria in finite two-player zero-sum games with bandit feedback, where the payoff matrix is unknown. It adapts the Explore-Then-Commit paradigm to TPZSGs and introduces three algorithms—ETC-TPZSG, ETC-TPZSG-AE, and ETC-TPZSG-AE-NUE—that balance exploration and commitment to equilibrium strategies. The authors derive instance-dependent regret bounds, including for ETC and for the elimination-based variants, and validate these results with experiments showing improved performance over baselines like Tsallis-INF. The results demonstrate that ETC-based approaches can effectively learn pure NE in adversarial environments, offering simple, analyzable methods with practical impact for learning in zero-sum games under partial feedback.

Abstract

We study a two-player zero-sum game in which the row player aims to maximize their payoff against an adversarial column player, under an unknown payoff matrix estimated through bandit feedback. We propose three algorithms based on the Explore-Then-Commit framework. The first adapts it to zero-sum games, the second incorporates adaptive elimination that leverages the -Nash Equilibrium property to efficiently select the optimal action pair, and the third extends the elimination algorithm by employing non-uniform exploration. Our objective is to demonstrate the applicability of ETC in a zero-sum game setting by focusing on learning pure strategy Nash Equilibria. A key contribution of our work is a derivation of instance-dependent upper bounds on the expected regret of our proposed algorithms, which has received limited attention in the literature on zero-sum games. Particularly, after rounds, we achieve an instance-dependent regret upper bounds of for ETC in zero-sum game setting and for the adaptive elimination algorithm and its variant with non-uniform exploration, where denotes the suboptimality gap. Therefore, our results indicate that ETC-based algorithms perform effectively in adversarial game settings, achieving regret bounds comparable to existing methods while providing insight through instance-dependent analysis.

Paper Structure

This paper contains 20 sections, 7 theorems, 54 equations, 5 figures, 3 algorithms.

Key Result

Theorem 1

The Nash regret of Algorithm alg:etc_game, when interacting with $\sigma$-subgaussian payoffs, is upper bounded as follows: where $k$ is the exploration time per action pair and $N$ is the total number of action pairs.

Figures (5)

  • Figure 1: The expected regret of ETC-TPZSG with $k$ in \ref{['eq:explorationtimeestimate']} and the upper bound in \ref{['eq:regretboundmin']}
  • Figure 2: The cumulative regrets of proposed algorithms and Tsallis-INF with large $\Delta$
  • Figure 3: The cumulative regrets of proposed algorithms and Tsallis-INF with small $\Delta$
  • Figure 4: The cumulative regrets of the algorithms for different game matrices with large $\Delta$
  • Figure 5: The cumulative regrets of the algorithms for different game matrices with small $\Delta$

Theorems & Definitions (16)

  • Definition 1
  • Definition 2: Pure Nash Equilibria
  • Definition 3: External regret
  • Definition 4: Nash regret
  • Theorem 1
  • proof : Proof Sketch
  • Definition 5
  • Theorem 2
  • proof : Proof Sketch
  • Theorem 3
  • ...and 6 more