ReLExS: Reinforcement Learning Explanations for Stackelberg No-Regret Learners

Xiangge Huang; Jingyuan Li; Jiaqing Xie

ReLExS: Reinforcement Learning Explanations for Stackelberg No-Regret Learners

Xiangge Huang, Jingyuan Li, Jiaqing Xie

TL;DR

This work addresses learning Stackelberg equilibria in two-player Markov games under a no-regret constraint on the follower. It introduces reward-average and general no-regret notions for the follower and proves that, under these conditions, the leader can use reinforcement learning to reach the Stackelberg value, with bounded deviations captured by $|U(L,F) - U_S(L,F)| < \varepsilon T + o(T)$. Theoretical results (Theorems 2, 5–11) show adaptive followers can preserve best responses, and no-regret dynamics ensure convergence toward Stackelberg equilibria, including restricted variants in constant-sum settings. The approach is validated empirically on 12 iterated matrix games, demonstrating that no-regret follower dynamics largely match regret-based performance, with caveats in certain games and memory scenarios. Overall, ReLExS provides a principled framework for implementing Stackelberg-learning systems where followers operate under no-regret constraints, with implications for economics, security games, and multi-agent ML.

Abstract

With the constraint of a no regret follower, will the players in a two-player Stackelberg game still reach Stackelberg equilibrium? We first show when the follower strategy is either reward-average or transform-reward-average, the two players can always get the Stackelberg Equilibrium. Then, we extend that the players can achieve the Stackelberg equilibrium in the two-player game under the no regret constraint. Also, we show a strict upper bound of the follower's utility difference between with and without no regret constraint. Moreover, in constant-sum two-player Stackelberg games with non-regret action sequences, we ensure the total optimal utility of the game remains also bounded.

ReLExS: Reinforcement Learning Explanations for Stackelberg No-Regret Learners

TL;DR

. Theoretical results (Theorems 2, 5–11) show adaptive followers can preserve best responses, and no-regret dynamics ensure convergence toward Stackelberg equilibria, including restricted variants in constant-sum settings. The approach is validated empirically on 12 iterated matrix games, demonstrating that no-regret follower dynamics largely match regret-based performance, with caveats in certain games and memory scenarios. Overall, ReLExS provides a principled framework for implementing Stackelberg-learning systems where followers operate under no-regret constraints, with implications for economics, security games, and multi-agent ML.

Abstract

Paper Structure (39 sections, 23 equations, 3 figures, 1 table)

This paper contains 39 sections, 23 equations, 3 figures, 1 table.

Introduction
Related Work
Learning Stackelberg Equilibria
Existing Implementations of Stackelberg Equilibrium
No regret learning
Model and Preliminaries
Formal Markov Game Model
Best Response.
Definition 1.
Theorem 2.
Proof.
no regret Learning and Characteristic
Definition 3.
Definition 4.
Theorem 5.
...and 24 more sections

Figures (3)

Figure 1: Mean episode reward of PPO+Meta+No Regret RL on 12 canonical symmetric iterated matrix games followed by oracles and followers gerstgrasser2023oracles. Green: original regret setting. Blue: no regret every 100 epochs. Orange: no regret after 100 epochs.
Figure 2: Memory to Leaders. Env: Prisoner's Dilemma
Figure 3: Original Regret Hidden vs. No Regret Hidden

ReLExS: Reinforcement Learning Explanations for Stackelberg No-Regret Learners

TL;DR

Abstract

ReLExS: Reinforcement Learning Explanations for Stackelberg No-Regret Learners

Authors

TL;DR

Abstract

Table of Contents

Figures (3)