Regret Minimization in Stackelberg Games with Side Information

Keegan Harris; Zhiwei Steven Wu; Maria-Florina Balcan

Regret Minimization in Stackelberg Games with Side Information

Keegan Harris, Zhiwei Steven Wu, Maria-Florina Balcan

TL;DR

This work extends online learning in Stackelberg games to settings with side information by formalizing Stackelberg games with context and follower types. It proves a fundamental impossibility result: no-regret learning is unattainable when both contexts and follower types are chosen adversarially. To enable learning, the paper introduces a discretization of the leader policy space into finite, context-dependent sets and analyzes two natural relaxations: stochastic followers with adversarial contexts, and stochastic contexts with adversarial followers, providing regret guarantees via greedy estimation and Hedge over policies. It further extends to bandit feedback using barycentric spanners to construct low-variance estimators, achieving tilde{O}(T^{2/3})-style regret in the bandit setting. Through simulations, the proposed methods outperform non-contextual baselines, showcasing practical impact for security, wildlife protection, and related applications where side information is available.

Abstract

Algorithms for playing in Stackelberg games have been deployed in real-world domains including airport security, anti-poaching efforts, and cyber-crime prevention. However, these algorithms often fail to take into consideration the additional information available to each player (e.g. traffic patterns, weather conditions, network congestion), which may significantly affect both players' optimal strategies. We formalize such settings as Stackelberg games with side information, in which both players observe an external context before playing. The leader commits to a (context-dependent) strategy, and the follower best-responds to both the leader's strategy and the context. We focus on the online setting in which a sequence of followers arrive over time, and the context may change from round-to-round. In sharp contrast to the non-contextual version, we show that it is impossible for the leader to achieve no-regret in the full adversarial setting. Motivated by this result, we show that no-regret learning is possible in two natural relaxations: the setting in which the sequence of followers is chosen stochastically and the sequence of contexts is adversarial, and the setting in which contexts are stochastic and follower types are adversarial.

Regret Minimization in Stackelberg Games with Side Information

TL;DR

Abstract

Paper Structure (18 sections, 25 theorems, 57 equations, 2 figures, 1 table, 4 algorithms)

This paper contains 18 sections, 25 theorems, 57 equations, 2 figures, 1 table, 4 algorithms.

Introduction
Setting and background
On the impossibility of fully adversarial no-regret learning
Limiting the power of the adversary
Stochastic follower types and adversarial contexts
Stochastic contexts and adversarial follower types
Simulations
Extension to bandit feedback
Conclusion
Appendix for \ref{['sec:impossibility']}: On the impossibility of fully adversarial no-regret learning
Appendix for Section \ref{['sec:full']}: Limiting the power of the adversary
Section \ref{['sec:followers']}: Stochastic follower types and adversarial contexts
Section \ref{['sec:contexts']}: Stochastic contexts and adversarial follower types
Appendix for Section \ref{['sec:bandit']}: Extension to bandit feedback
Stochastic follower types and adversarial contexts
...and 3 more sections

Key Result

Lemma 3.1

Any algorithm suffers regret $R_{\mathrm{OLT}}(T) = \Omega(T)$ in the online linear thresholding problem when the sequence of points $\omega_1, \ldots, \omega_T$ is chosen by an adversary.

Figures (2)

Figure 1: Summary of our reduction from the online linear thresholding problem. At time $t \in [T]$, (1.) the learner observes a point $\omega_t$, (2.) the learner takes a guess $g_t$, and (3.) the learner observes the true label $y_t$. Given a regret minimizer for our setting, we show how to use it in a black-box way (by constructing functions $h_1$, $h_2$, $h_3$) to achieve no-regret in the online linear thresholding problem.
Figure 2: Cumulative average reward of Algorithm 1, Algorithm 2, and the algorithm of balcan2015commitment (which does not take side information into consideration) over five runs in a synthetic data setup. Shaded regions represent one standard deviation.

Theorems & Definitions (50)

Definition 2.1: Follower Best-Response
Definition 2.2: Optimal Policy
Definition 2.3: Contextual Stackelberg Regret
Lemma 3.1
Theorem 3.2
Definition 4.1: Contextual Follower Best-Response Region
Definition 4.2: Contextual Best-Response Region
Definition 4.3: $\delta$-approximate extreme points
Lemma 4.3
Definition 4.4: Expected Contextual Stackelberg Regret
...and 40 more

Regret Minimization in Stackelberg Games with Side Information

TL;DR

Abstract

Regret Minimization in Stackelberg Games with Side Information

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (50)