Table of Contents
Fetching ...

Learning to Steer Learners in Games

Yizhou Zhang, Yi-An Ma, Eric Mazumdar

TL;DR

This work analyzes steering a no-regret learner in repeated two-player bimatrix games toward a Stackelberg equilibrium when the learner's payoff is unknown. It proves an impossibility result for fully general no-regret learners, then develops a payoff-matrix recovery framework using facets and equivalence classes, supplemented by pessimistic strategies to guarantee sublinear Stackelberg regret under certain estimation guarantees. Under restricted learner classes (ascending or stochastic mirror ascent), two concrete algorithms, PAAL and PAMD, demonstrate how to learn the learner's payoff structure within sublinear time and achieve $o(T)$ Stackelberg regret. The study clarifies the information requirements for steering in strategic online environments and provides practical explore-then-commit strategies with provable performance guarantees.

Abstract

We consider the problem of learning to exploit learning algorithms through repeated interactions in games. Specifically, we focus on the case of repeated two player, finite-action games, in which an optimizer aims to steer a no-regret learner to a Stackelberg equilibrium without knowledge of its payoffs. We first show that this is impossible if the optimizer only knows that the learner is using an algorithm from the general class of no-regret algorithms. This suggests that the optimizer requires more information about the learner's objectives or algorithm to successfully exploit them. Building on this intuition, we reduce the problem for the optimizer to that of recovering the learner's payoff structure. We demonstrate the effectiveness of this approach if the learner's algorithm is drawn from a smaller class by analyzing two examples: one where the learner uses an ascent algorithm, and another where the learner uses stochastic mirror ascent with known regularizer and step sizes.

Learning to Steer Learners in Games

TL;DR

This work analyzes steering a no-regret learner in repeated two-player bimatrix games toward a Stackelberg equilibrium when the learner's payoff is unknown. It proves an impossibility result for fully general no-regret learners, then develops a payoff-matrix recovery framework using facets and equivalence classes, supplemented by pessimistic strategies to guarantee sublinear Stackelberg regret under certain estimation guarantees. Under restricted learner classes (ascending or stochastic mirror ascent), two concrete algorithms, PAAL and PAMD, demonstrate how to learn the learner's payoff structure within sublinear time and achieve Stackelberg regret. The study clarifies the information requirements for steering in strategic online environments and provides practical explore-then-commit strategies with provable performance guarantees.

Abstract

We consider the problem of learning to exploit learning algorithms through repeated interactions in games. Specifically, we focus on the case of repeated two player, finite-action games, in which an optimizer aims to steer a no-regret learner to a Stackelberg equilibrium without knowledge of its payoffs. We first show that this is impossible if the optimizer only knows that the learner is using an algorithm from the general class of no-regret algorithms. This suggests that the optimizer requires more information about the learner's objectives or algorithm to successfully exploit them. Building on this intuition, we reduce the problem for the optimizer to that of recovering the learner's payoff structure. We demonstrate the effectiveness of this approach if the learner's algorithm is drawn from a smaller class by analyzing two examples: one where the learner uses an ascent algorithm, and another where the learner uses stochastic mirror ascent with known regularizer and step sizes.

Paper Structure

This paper contains 45 sections, 20 theorems, 167 equations, 6 figures, 5 algorithms.

Key Result

Theorem 1

There exists a pair of game instances $G_1=(A,B_1)$ and $G_2=(A,B_2)$ with the same optimizer payoff matrix $A$, such that for all optimizer algorithms $\mathcal{A}_1$, there exists a no-regret algorithm $\mathcal{A}_2$ for the learner satisfying: $StackReg_1(\mathcal{A}_1,\mathcal{A}_2)=o(T)$ on $G

Figures (6)

  • Figure 1: Plot of facets $E_1,E_2,E_3$.
  • Figure 2: Comparison among $E_1(B),E_1(\hat{B})$ and $E_1^-(\hat{B},d)$.
  • Figure 3: Learning dynamics for optimizer algorithms OGD and BS for matching pennies.
  • Figure 4: Learning dynamics for optimizer algorithms OGD and BS for game instance 1.
  • Figure 5: Learning dynamics for optimizer algorithms OGD and BS for game instance 2.
  • ...and 1 more figures

Theorems & Definitions (48)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Theorem 1
  • Definition 6
  • Example 2
  • Definition 7
  • Proposition 1
  • ...and 38 more