Table of Contents
Fetching ...

Regret Minimization in Stackelberg Games with Side Information

Keegan Harris, Zhiwei Steven Wu, Maria-Florina Balcan

TL;DR

This work extends online learning in Stackelberg games to settings with side information by formalizing Stackelberg games with context and follower types. It proves a fundamental impossibility result: no-regret learning is unattainable when both contexts and follower types are chosen adversarially. To enable learning, the paper introduces a discretization of the leader policy space into finite, context-dependent sets and analyzes two natural relaxations: stochastic followers with adversarial contexts, and stochastic contexts with adversarial followers, providing regret guarantees via greedy estimation and Hedge over policies. It further extends to bandit feedback using barycentric spanners to construct low-variance estimators, achieving tilde{O}(T^{2/3})-style regret in the bandit setting. Through simulations, the proposed methods outperform non-contextual baselines, showcasing practical impact for security, wildlife protection, and related applications where side information is available.

Abstract

Algorithms for playing in Stackelberg games have been deployed in real-world domains including airport security, anti-poaching efforts, and cyber-crime prevention. However, these algorithms often fail to take into consideration the additional information available to each player (e.g. traffic patterns, weather conditions, network congestion), which may significantly affect both players' optimal strategies. We formalize such settings as Stackelberg games with side information, in which both players observe an external context before playing. The leader commits to a (context-dependent) strategy, and the follower best-responds to both the leader's strategy and the context. We focus on the online setting in which a sequence of followers arrive over time, and the context may change from round-to-round. In sharp contrast to the non-contextual version, we show that it is impossible for the leader to achieve no-regret in the full adversarial setting. Motivated by this result, we show that no-regret learning is possible in two natural relaxations: the setting in which the sequence of followers is chosen stochastically and the sequence of contexts is adversarial, and the setting in which contexts are stochastic and follower types are adversarial.

Regret Minimization in Stackelberg Games with Side Information

TL;DR

This work extends online learning in Stackelberg games to settings with side information by formalizing Stackelberg games with context and follower types. It proves a fundamental impossibility result: no-regret learning is unattainable when both contexts and follower types are chosen adversarially. To enable learning, the paper introduces a discretization of the leader policy space into finite, context-dependent sets and analyzes two natural relaxations: stochastic followers with adversarial contexts, and stochastic contexts with adversarial followers, providing regret guarantees via greedy estimation and Hedge over policies. It further extends to bandit feedback using barycentric spanners to construct low-variance estimators, achieving tilde{O}(T^{2/3})-style regret in the bandit setting. Through simulations, the proposed methods outperform non-contextual baselines, showcasing practical impact for security, wildlife protection, and related applications where side information is available.

Abstract

Algorithms for playing in Stackelberg games have been deployed in real-world domains including airport security, anti-poaching efforts, and cyber-crime prevention. However, these algorithms often fail to take into consideration the additional information available to each player (e.g. traffic patterns, weather conditions, network congestion), which may significantly affect both players' optimal strategies. We formalize such settings as Stackelberg games with side information, in which both players observe an external context before playing. The leader commits to a (context-dependent) strategy, and the follower best-responds to both the leader's strategy and the context. We focus on the online setting in which a sequence of followers arrive over time, and the context may change from round-to-round. In sharp contrast to the non-contextual version, we show that it is impossible for the leader to achieve no-regret in the full adversarial setting. Motivated by this result, we show that no-regret learning is possible in two natural relaxations: the setting in which the sequence of followers is chosen stochastically and the sequence of contexts is adversarial, and the setting in which contexts are stochastic and follower types are adversarial.
Paper Structure (18 sections, 25 theorems, 57 equations, 2 figures, 1 table, 4 algorithms)

This paper contains 18 sections, 25 theorems, 57 equations, 2 figures, 1 table, 4 algorithms.

Key Result

Lemma 3.1

Any algorithm suffers regret $R_{\mathrm{OLT}}(T) = \Omega(T)$ in the online linear thresholding problem when the sequence of points $\omega_1, \ldots, \omega_T$ is chosen by an adversary.

Figures (2)

  • Figure 1: Summary of our reduction from the online linear thresholding problem. At time $t \in [T]$, (1.) the learner observes a point $\omega_t$, (2.) the learner takes a guess $g_t$, and (3.) the learner observes the true label $y_t$. Given a regret minimizer for our setting, we show how to use it in a black-box way (by constructing functions $h_1$, $h_2$, $h_3$) to achieve no-regret in the online linear thresholding problem.
  • Figure 2: Cumulative average reward of Algorithm 1, Algorithm 2, and the algorithm of balcan2015commitment (which does not take side information into consideration) over five runs in a synthetic data setup. Shaded regions represent one standard deviation.

Theorems & Definitions (50)

  • Definition 2.1: Follower Best-Response
  • Definition 2.2: Optimal Policy
  • Definition 2.3: Contextual Stackelberg Regret
  • Lemma 3.1
  • Theorem 3.2
  • Definition 4.1: Contextual Follower Best-Response Region
  • Definition 4.2: Contextual Best-Response Region
  • Definition 4.3: $\delta$-approximate extreme points
  • Lemma 4.3
  • Definition 4.4: Expected Contextual Stackelberg Regret
  • ...and 40 more