Optimistic Thompson Sampling for No-Regret Learning in Unknown Games
Yingru Li, Liangqi Liu, Wenqiang Pu, Hao Liang, Zhi-Quan Luo
TL;DR
The paper addresses learning in unknown multi-agent games with bandit feedback by introducing Optimistic Thompson Sampling (OTS) within an Optimism-then-NoRegret (OTN) framework. By leveraging opponent actions and reward structure, the approach achieves sublinear adversarial regret and, under structured rewards, a logarithmic dependence on the action space size, mitigating the curse of multi-player. A key contribution is the regret decomposition and use of information-theoretic bounds via the maximum information gain $\gamma_T$, plus concrete instantiations like OTS-Hedge and OTS-RM that show strong performance across random matrix games and real-world tasks such as radar anti-jamming and traffic routing. Practically, the methods dramatically reduce sample complexity relative to traditional approaches, enabling scalable decision-making in large multi-agent environments and offering a flexible framework (OTN) to integrate optimism with no-regret learning in unknown games.
Abstract
This work tackles the complexities of multi-player scenarios in \emph{unknown games}, where the primary challenge lies in navigating the uncertainty of the environment through bandit feedback alongside strategic decision-making. We introduce Thompson Sampling (TS)-based algorithms that exploit the information of opponents' actions and reward structures, leading to a substantial reduction in experimental budgets -- achieving over tenfold improvements compared to conventional approaches. Notably, our algorithms demonstrate that, given specific reward structures, the regret bound depends logarithmically on the total action space, significantly alleviating the curse of multi-player. Furthermore, we unveil the \emph{Optimism-then-NoRegret} (OTN) framework, a pioneering methodology that seamlessly incorporates our advancements with established algorithms, showcasing its utility in practical scenarios such as traffic routing and radar sensing in the real world.
