Table of Contents
Fetching ...

Nearly-Optimal Bandit Learning in Stackelberg Games with Side Information

Maria-Florina Balcan, Martino Bernasconi, Matteo Castiglioni, Andrea Celli, Keegan Harris, Zhiwei Steven Wu

TL;DR

This work studies online learning in Stackelberg games with side information under bandit feedback, focusing on the leader's regret. It introduces a reduction to linear contextual bandits in the leader's utility space, enabling $ ilde{O}(T^{1/2})$ regret and enabling extensions to unknown utilities and applications to auctions and Bayesian persuasion. The results demonstrate improved theoretical regret bounds and practical performance, supported by experiments, and lay groundwork for practical learning in strategic settings with contextual information.

Abstract

We study the problem of online learning in Stackelberg games with side information between a leader and a sequence of followers. In every round the leader observes contextual information and commits to a mixed strategy, after which the follower best-responds. We provide learning algorithms for the leader which achieve $O(T^{1/2})$ regret under bandit feedback, an improvement from the previously best-known rates of $O(T^{2/3})$. Our algorithms rely on a reduction to linear contextual bandits in the utility space: In each round, a linear contextual bandit algorithm recommends a utility vector, which our algorithm inverts to determine the leader's mixed strategy. We extend our algorithms to the setting in which the leader's utility function is unknown, and also apply it to the problems of bidding in second-price auctions with side information and online Bayesian persuasion with public and private states. Finally, we observe that our algorithms empirically outperform previous results on numerical simulations.

Nearly-Optimal Bandit Learning in Stackelberg Games with Side Information

TL;DR

This work studies online learning in Stackelberg games with side information under bandit feedback, focusing on the leader's regret. It introduces a reduction to linear contextual bandits in the leader's utility space, enabling regret and enabling extensions to unknown utilities and applications to auctions and Bayesian persuasion. The results demonstrate improved theoretical regret bounds and practical performance, supported by experiments, and lay groundwork for practical learning in strategic settings with contextual information.

Abstract

We study the problem of online learning in Stackelberg games with side information between a leader and a sequence of followers. In every round the leader observes contextual information and commits to a mixed strategy, after which the follower best-responds. We provide learning algorithms for the leader which achieve regret under bandit feedback, an improvement from the previously best-known rates of . Our algorithms rely on a reduction to linear contextual bandits in the utility space: In each round, a linear contextual bandit algorithm recommends a utility vector, which our algorithm inverts to determine the leader's mixed strategy. We extend our algorithms to the setting in which the leader's utility function is unknown, and also apply it to the problems of bidding in second-price auctions with side information and online Bayesian persuasion with public and private states. Finally, we observe that our algorithms empirically outperform previous results on numerical simulations.

Paper Structure

This paper contains 23 sections, 14 theorems, 35 equations, 1 figure, 3 algorithms.

Key Result

Theorem 3.1

When $\mathcal{R}$ is instantiated as the OFUL algorithm of abbasi2011improved, alg:meta obtains expected contextual Stackelberg regret when the sequence of contexts is chosen adversarially and the sequence of follower types is chosen stochastically.

Figures (1)

  • Figure 1: Empirical Results

Theorems & Definitions (26)

  • Definition 2.1: Contextual Stackelberg Regret
  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.4
  • Corollary 3.5
  • Corollary 3.6
  • Definition 3.7: Contextual Follower Best-Response Region
  • Definition 3.8: Contextual Best-Response Region
  • Definition 3.9: $\delta$-approximate extreme points
  • Definition 3.10: Effective follower types
  • ...and 16 more