Decentralized Online Learning in General-Sum Stackelberg Games
Yaolong Yu, Haipeng Chen
TL;DR
This work investigates online, decentralized learning in repeated general-sum Stackelberg games with a leader–follower structure under two follower information regimes. It develops algorithms for both players and proves last-iterate convergence and sample complexity bounds, covering a myopic follower with limited information and a manipulative follower with side information (omniscient or noisy). The FBM and FMUCB strategies reveal intrinsic follower advantages when information about the leader’s rewards is available, highlighting that manipulation can outperform best-response dynamics in online settings. Empirical results on synthetic games corroborate the theoretical guarantees, demonstrating convergence and measurable gains for the follower under manipulation strategies. Overall, the paper advances understanding of how information asymmetry and strategic behavior affect learning dynamics in Stackelberg games and provides practical online-learning tools for decentralized settings.
Abstract
We study an online learning problem in general-sum Stackelberg games, where players act in a decentralized and strategic manner. We study two settings depending on the type of information for the follower: (1) the limited information setting where the follower only observes its own reward, and (2) the side information setting where the follower has extra side information about the leader's reward. We show that for the follower, myopically best responding to the leader's action is the best strategy for the limited information setting, but not necessarily so for the side information setting -- the follower can manipulate the leader's reward signals with strategic actions, and hence induce the leader's strategy to converge to an equilibrium that is better off for itself. Based on these insights, we study decentralized online learning for both players in the two settings. Our main contribution is to derive last-iterate convergence and sample complexity results in both settings. Notably, we design a new manipulation strategy for the follower in the latter setting, and show that it has an intrinsic advantage against the best response strategy. Our theories are also supported by empirical results.
