Table of Contents
Fetching ...

ReLExS: Reinforcement Learning Explanations for Stackelberg No-Regret Learners

Xiangge Huang, Jingyuan Li, Jiaqing Xie

TL;DR

This work addresses learning Stackelberg equilibria in two-player Markov games under a no-regret constraint on the follower. It introduces reward-average and general no-regret notions for the follower and proves that, under these conditions, the leader can use reinforcement learning to reach the Stackelberg value, with bounded deviations captured by $|U(L,F) - U_S(L,F)| < \varepsilon T + o(T)$. Theoretical results (Theorems 2, 5–11) show adaptive followers can preserve best responses, and no-regret dynamics ensure convergence toward Stackelberg equilibria, including restricted variants in constant-sum settings. The approach is validated empirically on 12 iterated matrix games, demonstrating that no-regret follower dynamics largely match regret-based performance, with caveats in certain games and memory scenarios. Overall, ReLExS provides a principled framework for implementing Stackelberg-learning systems where followers operate under no-regret constraints, with implications for economics, security games, and multi-agent ML.

Abstract

With the constraint of a no regret follower, will the players in a two-player Stackelberg game still reach Stackelberg equilibrium? We first show when the follower strategy is either reward-average or transform-reward-average, the two players can always get the Stackelberg Equilibrium. Then, we extend that the players can achieve the Stackelberg equilibrium in the two-player game under the no regret constraint. Also, we show a strict upper bound of the follower's utility difference between with and without no regret constraint. Moreover, in constant-sum two-player Stackelberg games with non-regret action sequences, we ensure the total optimal utility of the game remains also bounded.

ReLExS: Reinforcement Learning Explanations for Stackelberg No-Regret Learners

TL;DR

This work addresses learning Stackelberg equilibria in two-player Markov games under a no-regret constraint on the follower. It introduces reward-average and general no-regret notions for the follower and proves that, under these conditions, the leader can use reinforcement learning to reach the Stackelberg value, with bounded deviations captured by . Theoretical results (Theorems 2, 5–11) show adaptive followers can preserve best responses, and no-regret dynamics ensure convergence toward Stackelberg equilibria, including restricted variants in constant-sum settings. The approach is validated empirically on 12 iterated matrix games, demonstrating that no-regret follower dynamics largely match regret-based performance, with caveats in certain games and memory scenarios. Overall, ReLExS provides a principled framework for implementing Stackelberg-learning systems where followers operate under no-regret constraints, with implications for economics, security games, and multi-agent ML.

Abstract

With the constraint of a no regret follower, will the players in a two-player Stackelberg game still reach Stackelberg equilibrium? We first show when the follower strategy is either reward-average or transform-reward-average, the two players can always get the Stackelberg Equilibrium. Then, we extend that the players can achieve the Stackelberg equilibrium in the two-player game under the no regret constraint. Also, we show a strict upper bound of the follower's utility difference between with and without no regret constraint. Moreover, in constant-sum two-player Stackelberg games with non-regret action sequences, we ensure the total optimal utility of the game remains also bounded.
Paper Structure (39 sections, 23 equations, 3 figures, 1 table)

This paper contains 39 sections, 23 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Mean episode reward of PPO+Meta+No Regret RL on 12 canonical symmetric iterated matrix games followed by oracles and followers gerstgrasser2023oracles. Green: original regret setting. Blue: no regret every 100 epochs. Orange: no regret after 100 epochs.
  • Figure 2: Memory to Leaders. Env: Prisoner's Dilemma
  • Figure 3: Original Regret Hidden vs. No Regret Hidden