Strategizing against No-regret Learners
Yuan Deng, Jon Schneider, Balusubramanian Sivan
TL;DR
This work studies how to act optimally in repeated two-player bimatrix games when one player follows a no-regret learning strategy. It shows that the opponent can secure at least $(V - \varepsilon)T - o(T)$ total utility, where $V$ is the Stackelberg value, against any no-regret learner, with upper bounds $VT$ under several regimes (constant-sum, no-swap regret, two learner actions); surprisingly, against mean-based learners with at least three actions, beating the Stackelberg value is possible in some games, and the authors formulate a mean-based, multi-dimensional control problem to characterize the asymptotically optimal play. They also establish that if the learner enforces no-swap regret, the optimizer cannot surpass the Stackelberg value, and they provide a concrete construction showing beating Stackelberg against mean-based learners in specific instances. The analysis connects regret, Stackelberg dynamics, and correlated equilibria, and offers a geometric/state-space perspective that reduces the problem to an $N$-dimensional control problem, highlighting both the potential gains and the open challenge of computing the optimal mean-based strategy. Overall, the work extends prior results from auction settings to general games, clarifying when the Stackelberg benchmark is tight and when it can be surpassed under mean-based learning.
Abstract
How should a player who repeatedly plays a game against a no-regret learner strategize to maximize his utility? We study this question and show that under some mild assumptions, the player can always guarantee himself a utility of at least what he would get in a Stackelberg equilibrium of the game. When the no-regret learner has only two actions, we show that the player cannot get any higher utility than the Stackelberg equilibrium utility. But when the no-regret learner has more than two actions and plays a mean-based no-regret strategy, we show that the player can get strictly higher than the Stackelberg equilibrium utility. We provide a characterization of the optimal game-play for the player against a mean-based no-regret learner as a solution to a control problem. When the no-regret learner's strategy also guarantees him a no-swap regret, we show that the player cannot get anything higher than a Stackelberg equilibrium utility.
