Table of Contents
Fetching ...

Learning in Stackelberg Games with Non-myopic Agents

Nika Haghtalab, Thodoris Lykouris, Sloan Nietert, Alexander Wei

TL;DR

The main questions of this work are: What are principled approaches to learning against non-myopic agents in general Stackelberg games?

Abstract

We study Stackelberg games where a principal repeatedly interacts with a non-myopic long-lived agent, without knowing the agent's payoff function. Although learning in Stackelberg games is well-understood when the agent is myopic, dealing with non-myopic agents poses additional complications. In particular, non-myopic agents may strategize and select actions that are inferior in the present in order to mislead the principal's learning algorithm and obtain better outcomes in the future. We provide a general framework that reduces learning in presence of non-myopic agents to robust bandit optimization in the presence of myopic agents. Through the design and analysis of minimally reactive bandit algorithms, our reduction trades off the statistical efficiency of the principal's learning algorithm against its effectiveness in inducing near-best-responses. We apply this framework to Stackelberg security games (SSGs), pricing with unknown demand curve, general finite Stackelberg games, and strategic classification. In each setting, we characterize the type and impact of misspecifications present in near-best responses and develop a learning algorithm robust to such misspecifications. On the way, we improve the state-of-the-art query complexity of learning in SSGs with $n$ targets from $O(n^3)$ to a near-optimal $\widetilde{O}(n)$ by uncovering a fundamental structural property of these games. The latter result is of independent interest beyond learning with non-myopic agents.

Learning in Stackelberg Games with Non-myopic Agents

TL;DR

The main questions of this work are: What are principled approaches to learning against non-myopic agents in general Stackelberg games?

Abstract

We study Stackelberg games where a principal repeatedly interacts with a non-myopic long-lived agent, without knowing the agent's payoff function. Although learning in Stackelberg games is well-understood when the agent is myopic, dealing with non-myopic agents poses additional complications. In particular, non-myopic agents may strategize and select actions that are inferior in the present in order to mislead the principal's learning algorithm and obtain better outcomes in the future. We provide a general framework that reduces learning in presence of non-myopic agents to robust bandit optimization in the presence of myopic agents. Through the design and analysis of minimally reactive bandit algorithms, our reduction trades off the statistical efficiency of the principal's learning algorithm against its effectiveness in inducing near-best-responses. We apply this framework to Stackelberg security games (SSGs), pricing with unknown demand curve, general finite Stackelberg games, and strategic classification. In each setting, we characterize the type and impact of misspecifications present in near-best responses and develop a learning algorithm robust to such misspecifications. On the way, we improve the state-of-the-art query complexity of learning in SSGs with targets from to a near-optimal by uncovering a fundamental structural property of these games. The latter result is of independent interest beyond learning with non-myopic agents.
Paper Structure (74 sections, 39 theorems, 53 equations, 8 figures, 1 table, 15 algorithms)

This paper contains 74 sections, 39 theorems, 53 equations, 8 figures, 1 table, 15 algorithms.

Key Result

Proposition 2.1

Let $0 < \gamma < 1$ and $\varepsilon \geq 0$. Fix $D = \lceil T_\gamma \log (T_\gamma/\varepsilon) \rceil$, where $T_\gamma = \frac{1}{1-\gamma}$ is the agent's discounted time horizon. Then, if principal policy $\mathcal{A}$ is $D$-delayed, we have $R_\mathcal{A}(T,\gamma) \leq R_\mathcal{A}^\vare

Figures (8)

  • Figure 1: Agent utility profiles for a wasteful principal strategy $\mathbf{x}^w$ and a conservative strategy $\mathbf{x}^c$ for a 3-target SSG. The strategy $\mathbf{x}^w$ is wasteful because it allocates non-zero weight to targets 2 and 3, but $\mathsf{BR}\@ifnotempty{}{^{}}\@ifnotempty{}{_{}}(\mathbf{x}^w) = \{1\}$. The strategy $\mathbf{x}^c$ is conservative because $\mathsf{BR}\@ifnotempty{}{^{}}\@ifnotempty{}{_{}}(\mathbf{x}^c) = \{1,3\}$ and it allocates no weight to target 2.
  • Figure 2: The principal strategy space $\mathcal{X} = \{ (x_1,x_2)\in[0,1]^2 : x_1^2 + x_2^2 \leq 1 \}$ for a two-target game with non-convex best response regions $K_1$ and $K_2$, induced by agent payoffs $v^1(x_1) = 1 - \frac{3}{4}(1+e^{5-15x_1})^{-1}$ and $v^2(x_2) = 1-x_2$. Observe that the unique conservative minimizer $\mathbf{x}^\star$ lies in the intersection $K_1 \cap K_2$.
  • Figure 3: Query complexity of Clinch versus SecuritySearch in two settings. The $y$-axis shows the number of calls the principal must make to the best response oracle. Both sets of axes are displayed on a log-log scale. Dashed lines depict power law fits of the scaling curves, obtained from log-log linear regression.
  • Figure 4: Regret achieved by batched and multi-threaded variants of Clinch against a simulated $\gamma$-discounting agent on a random SSG instance. For each discount factor, we note the optimal batch size $B^\star$ at $T=500$. We note that $B=1$ corresponds to the baseline $\hbox{\normalfontClinch}{}$ algorithm designed for myopic agents.
  • Figure 5: Regret achieved by batched and multi-threaded variants of Clinch against a simulated $\gamma$-discounting agent on four random SSG instances with $n=3$. For each instance and discount factor, we note the optimal batch size $B^\star$ at $T=500$.
  • ...and 3 more figures

Theorems & Definitions (89)

  • Proposition 2.1
  • Proposition 2.2
  • Remark 3.1
  • Remark 3.2
  • Definition 3.3
  • Lemma 3.4
  • proof
  • Proposition 3.5
  • proof
  • Remark 3.6
  • ...and 79 more