Table of Contents
Fetching ...

Learning to Play Multi-Follower Bayesian Stackelberg Games

Gerson Personnat, Tao Lin, Safwan Hossain, David C. Parkes

TL;DR

The paper tackles online learning for a leader in a multi-follower Bayesian Stackelberg game with unknown type distribution, presenting two feedback models—type feedback and action feedback—and establishing sublinear regret guarantees. A core contribution is a geometric partition of the leader’s strategy space into BR regions, within which the leader’s expected utility is linear, enabling efficient offline optimization and scalable online learning. The authors derive regret bounds that adapt to whether follower types are independent or general, and provide a lower bound that is tight up to logarithmic factors in key regimes; they also propose action-feedback algorithms based on linear-bandit reductions and UCB over BR regions. Collectively, the results show that sublinear regret is achievable even when the joint type space is exponentially large, with a clear tradeoff between computational complexity and regret, especially as the number of leader actions $L$ grows. The findings have implications for online platforms and security settings where leaders must learn strategic interactions with many private-type followers under limited feedback.

Abstract

In a multi-follower Bayesian Stackelberg game, a leader plays a mixed strategy over $L$ actions to which $n\ge 1$ followers, each having one of $K$ possible private types, best respond. The leader's optimal strategy depends on the distribution of the followers' private types. We study an online learning version of this problem: a leader interacts for $T$ rounds with $n$ followers with types sampled from an unknown distribution every round. The leader's goal is to minimize regret, defined as the difference between the cumulative utility of the optimal strategy and that of the actually chosen strategies. We design learning algorithms for the leader under different feedback settings. Under type feedback, where the leader observes the followers' types after each round, we design algorithms that achieve $\mathcal O\big(\sqrt{\min\{L\log(nKA T), nK \} \cdot T} \big)$ regret for independent type distributions and $\mathcal O\big(\sqrt{\min\{L\log(nKA T), K^n \} \cdot T} \big)$ regret for general type distributions. Interestingly, those bounds do not grow with $n$ at a polynomial rate. Under action feedback, where the leader only observes the followers' actions, we design algorithms with $\mathcal O( \min\{\sqrt{ n^L K^L A^{2L} L T \log T}, K^n\sqrt{ T } \log T \} )$ regret. We also provide a lower bound of $Ω(\sqrt{\min\{L, nK\}T})$, almost matching the type-feedback upper bounds.

Learning to Play Multi-Follower Bayesian Stackelberg Games

TL;DR

The paper tackles online learning for a leader in a multi-follower Bayesian Stackelberg game with unknown type distribution, presenting two feedback models—type feedback and action feedback—and establishing sublinear regret guarantees. A core contribution is a geometric partition of the leader’s strategy space into BR regions, within which the leader’s expected utility is linear, enabling efficient offline optimization and scalable online learning. The authors derive regret bounds that adapt to whether follower types are independent or general, and provide a lower bound that is tight up to logarithmic factors in key regimes; they also propose action-feedback algorithms based on linear-bandit reductions and UCB over BR regions. Collectively, the results show that sublinear regret is achievable even when the joint type space is exponentially large, with a clear tradeoff between computational complexity and regret, especially as the number of leader actions grows. The findings have implications for online platforms and security settings where leaders must learn strategic interactions with many private-type followers under limited feedback.

Abstract

In a multi-follower Bayesian Stackelberg game, a leader plays a mixed strategy over actions to which followers, each having one of possible private types, best respond. The leader's optimal strategy depends on the distribution of the followers' private types. We study an online learning version of this problem: a leader interacts for rounds with followers with types sampled from an unknown distribution every round. The leader's goal is to minimize regret, defined as the difference between the cumulative utility of the optimal strategy and that of the actually chosen strategies. We design learning algorithms for the leader under different feedback settings. Under type feedback, where the leader observes the followers' types after each round, we design algorithms that achieve regret for independent type distributions and regret for general type distributions. Interestingly, those bounds do not grow with at a polynomial rate. Under action feedback, where the leader only observes the followers' actions, we design algorithms with regret. We also provide a lower bound of , almost matching the type-feedback upper bounds.

Paper Structure

This paper contains 41 sections, 21 theorems, 84 equations, 3 figures, 1 table, 5 algorithms.

Key Result

Lemma 3.1

For each $W$, the leader's expected utility function $U_{ \boldsymbol{\mathcal{D}} }(x)$ is linear in $x \in R(W)$.

Figures (3)

  • Figure 1: A single-follower best-response region with $K = 3$ types and two follower actions and three leader actions -- $A = 2, L=3$. The triangle represents the probability simplex $\Delta(\mathcal{L})$. The three hyperplanes defined by $d_{1}(0, 1)$, $d_{2}(1, 0)$ and $d_3(1, 0)$ partition the simplex into best-response regions. For example, in region $R(w_{0,1,1})$, the follower best-responds with action $0$ for type $1$, and action $1$ for types $2$ and $3$.
  • Figure 2: Cumulative regret from the type-feedback based Algorithms \ref{['alg:full-feedback_general']} and \ref{['alg:full-feedback']} for an $(L=2, K=6, A=2, n=2)$ instance with independent types. We plot the average over 2000 simulations with 90% confidence intervals.
  • Figure 3: Cumulative regret from Algorithm \ref{['alg:stackelberg-linear-bandit']} (the Linear-Bandit approach inspired by bernasconi_optimal_2023) and Algorithm \ref{['alg:action-feedback-ucb-algorithm']} for an $(L=2, K=6, A=2, n=2)$ instance. We plot the average over 2000 simulations with 90% confidence intervals.

Theorems & Definitions (43)

  • Definition 2.1: Followers' Best Response
  • Definition 2.2: Leader's Optimal Strategy
  • Definition 2.3
  • Definition 3.1: Best-Response Region
  • Lemma 3.1
  • Lemma 3.2
  • Lemma 3.3
  • Lemma 4.1
  • Theorem 4.1
  • Theorem 4.2
  • ...and 33 more