Optimistic Thompson Sampling for No-Regret Learning in Unknown Games

Yingru Li; Liangqi Liu; Wenqiang Pu; Hao Liang; Zhi-Quan Luo

Optimistic Thompson Sampling for No-Regret Learning in Unknown Games

Yingru Li, Liangqi Liu, Wenqiang Pu, Hao Liang, Zhi-Quan Luo

TL;DR

The paper addresses learning in unknown multi-agent games with bandit feedback by introducing Optimistic Thompson Sampling (OTS) within an Optimism-then-NoRegret (OTN) framework. By leveraging opponent actions and reward structure, the approach achieves sublinear adversarial regret and, under structured rewards, a logarithmic dependence on the action space size, mitigating the curse of multi-player. A key contribution is the regret decomposition and use of information-theoretic bounds via the maximum information gain $\gamma_T$, plus concrete instantiations like OTS-Hedge and OTS-RM that show strong performance across random matrix games and real-world tasks such as radar anti-jamming and traffic routing. Practically, the methods dramatically reduce sample complexity relative to traditional approaches, enabling scalable decision-making in large multi-agent environments and offering a flexible framework (OTN) to integrate optimism with no-regret learning in unknown games.

Abstract

This work tackles the complexities of multi-player scenarios in \emph{unknown games}, where the primary challenge lies in navigating the uncertainty of the environment through bandit feedback alongside strategic decision-making. We introduce Thompson Sampling (TS)-based algorithms that exploit the information of opponents' actions and reward structures, leading to a substantial reduction in experimental budgets -- achieving over tenfold improvements compared to conventional approaches. Notably, our algorithms demonstrate that, given specific reward structures, the regret bound depends logarithmically on the total action space, significantly alleviating the curse of multi-player. Furthermore, we unveil the \emph{Optimism-then-NoRegret} (OTN) framework, a pioneering methodology that seamlessly incorporates our advancements with established algorithms, showcasing its utility in practical scenarios such as traffic routing and radar sensing in the real world.

Optimistic Thompson Sampling for No-Regret Learning in Unknown Games

TL;DR

, plus concrete instantiations like OTS-Hedge and OTS-RM that show strong performance across random matrix games and real-world tasks such as radar anti-jamming and traffic routing. Practically, the methods dramatically reduce sample complexity relative to traditional approaches, enabling scalable decision-making in large multi-agent environments and offering a flexible framework (OTN) to integrate optimism with no-regret learning in unknown games.

Abstract

Paper Structure (69 sections, 20 theorems, 125 equations, 15 figures, 6 tables, 3 algorithms)

This paper contains 69 sections, 20 theorems, 125 equations, 15 figures, 6 tables, 3 algorithms.

Introduction
Related works
Adversarial Bandits.
Learning in Games.
Structure & Opponent Awareness.
Repeated bandit game
Protocol.
Algorithm.
Reward and Performance Metric.
Optimism-then-NoRegret learning
Review of Full Information Feedback
Bandit Feedback
Challenges with TS
Mitigating Strategy: Optimistic Sampling
Regret analysis
...and 54 more sections

Key Result

Proposition 3.1

Given any action $a \in \mathcal{A}$, we can dissect the one-step regret as follows: where:

Figures (15)

Figure 1: Reduction on experimental budgets by our methods in terms of # samples required for various algorithms to reach the same level in average regret $(10^{-1})$ under the $50\times50$ matrix game setups ( \ref{['sec:rmg']}). IWE, UCB and OTS correspond to Importance Weighted Estimator, Upper Confidence Bound and Optimistic Thompson sampling. RM corresponds to Regret Matching. IWE-RM, UCB-RM, OTS-Hedge and OTS-RM are our proposed algorithms. The compared baselines include IWE-Hedge (known as Exp3 auer2002nonstochastic while we implement all Exp3 variants stoltz2005incompletekocak2014efficientlattimore2020bandit and select the best as baseline.) and UCB-Hedge (known as GP-MW in sessa2019no).
Figure 2: Illustration of failure probabilities for TS-RM versus OTS-RM strategies.
Figure 3: Averaged regrets for different opponents in a $50\times50$ matrix game.
Figure 4: Reward histograms of different algorithms against a non-stationary opponent, illustrating the performance disparity between algorithm types. OTS-based and UCB-based algorithms demonstrate significant efficiency, particularly in adapting to the jammer's changing strategies.
Figure 5: Averaged regrets in the anti-jamming problem demonstrate the superior efficiency of OTS-based and UCB-based algorithms over IWE-based ones. Notably, OTS-Hedge requires significantly fewer samples to achieve a comparable level of average regret, illustrating its robustness and adaptability. Specifically, to achieve an average regret of $10^{-2}$, OTS-Hedge requires only $3.7\%$ and $31\%$ of the samples needed by IWE-Hedge and UCB-Hedge, respectively; OTS-RM uses only 0.1% of IWE-RM's, 25% of UCB-RM's, and 53% of OTS-Hedge's samples.
...and 10 more figures

Theorems & Definitions (62)

Proposition 3.1: Regret Decomposition
Proposition 3.2: IWE-Hedge (Exp3) Analysis
Theorem 3.3: Analysis of IWE-RM
Remark 3.4
Theorem 3.5: Efficacy of UCB-Hedge and UCB-RM
Remark 3.6
Example 3.1: Best Response Player
Proposition 3.7: Limitations of TS-RM
Theorem 3.8: Advantages of OTS
Remark 3.9
...and 52 more

Optimistic Thompson Sampling for No-Regret Learning in Unknown Games

TL;DR

Abstract

Optimistic Thompson Sampling for No-Regret Learning in Unknown Games

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (62)