Learning to Play Multi-Follower Bayesian Stackelberg Games
Gerson Personnat, Tao Lin, Safwan Hossain, David C. Parkes
TL;DR
The paper tackles online learning for a leader in a multi-follower Bayesian Stackelberg game with unknown type distribution, presenting two feedback models—type feedback and action feedback—and establishing sublinear regret guarantees. A core contribution is a geometric partition of the leader’s strategy space into BR regions, within which the leader’s expected utility is linear, enabling efficient offline optimization and scalable online learning. The authors derive regret bounds that adapt to whether follower types are independent or general, and provide a lower bound that is tight up to logarithmic factors in key regimes; they also propose action-feedback algorithms based on linear-bandit reductions and UCB over BR regions. Collectively, the results show that sublinear regret is achievable even when the joint type space is exponentially large, with a clear tradeoff between computational complexity and regret, especially as the number of leader actions $L$ grows. The findings have implications for online platforms and security settings where leaders must learn strategic interactions with many private-type followers under limited feedback.
Abstract
In a multi-follower Bayesian Stackelberg game, a leader plays a mixed strategy over $L$ actions to which $n\ge 1$ followers, each having one of $K$ possible private types, best respond. The leader's optimal strategy depends on the distribution of the followers' private types. We study an online learning version of this problem: a leader interacts for $T$ rounds with $n$ followers with types sampled from an unknown distribution every round. The leader's goal is to minimize regret, defined as the difference between the cumulative utility of the optimal strategy and that of the actually chosen strategies. We design learning algorithms for the leader under different feedback settings. Under type feedback, where the leader observes the followers' types after each round, we design algorithms that achieve $\mathcal O\big(\sqrt{\min\{L\log(nKA T), nK \} \cdot T} \big)$ regret for independent type distributions and $\mathcal O\big(\sqrt{\min\{L\log(nKA T), K^n \} \cdot T} \big)$ regret for general type distributions. Interestingly, those bounds do not grow with $n$ at a polynomial rate. Under action feedback, where the leader only observes the followers' actions, we design algorithms with $\mathcal O( \min\{\sqrt{ n^L K^L A^{2L} L T \log T}, K^n\sqrt{ T } \log T \} )$ regret. We also provide a lower bound of $Ω(\sqrt{\min\{L, nK\}T})$, almost matching the type-feedback upper bounds.
