Table of Contents
Fetching ...

Performative Reinforcement Learning with Linear Markov Decision Process

Debmalya Mandal, Goran Radanovic

TL;DR

The paper advances performative reinforcement learning to linear MDPs where deployed policies alter both rewards and dynamics. It develops a regularized, dual-aware optimization framework that yields last-iterate convergence to a performatively stable policy, aided by a new recurrence based on time-varying dual solutions. In the finite-sample regime, it introduces a reparameterization and an empirical Lagrangian solved via a saddle-point method, achieving polynomial-in-$D$ sample complexity under a bounded-coverage assumption. The approach generalizes to multi-agent settings, including stochastic Stackelberg games, and provides principled pathways for dimension-efficient learning in reactive environments. Overall, the work bridges performative RL theory with scalable linear-function approximation and practical saddle-point algorithms.

Abstract

We study the setting of \emph{performative reinforcement learning} where the deployed policy affects both the reward, and the transition of the underlying Markov decision process. Prior work~\parencite{MTR23} has addressed this problem under the tabular setting and established last-iterate convergence of repeated retraining with iteration complexity explicitly depending on the number of states. In this work, we generalize the results to \emph{linear Markov decision processes} which is the primary theoretical model of large-scale MDPs. The main challenge with linear MDP is that the regularized objective is no longer strongly convex and we want a bound that scales with the dimension of the features, rather than states which can be infinite. Our first result shows that repeatedly optimizing a regularized objective converges to a \emph{performatively stable policy}. In the absence of strong convexity, our analysis leverages a new recurrence relation that uses a specific linear combination of optimal dual solutions for proving convergence. We then tackle the finite sample setting where the learner has access to a set of trajectories drawn from the current policy. We consider a reparametrized version of the primal problem, and construct an empirical Lagrangian which is to be optimized from the samples. We show that, under a \emph{bounded coverage} condition, repeatedly solving a saddle point of this empirical Lagrangian converges to a performatively stable solution, and also construct a primal-dual algorithm that solves the empirical Lagrangian efficiently. Finally, we show several applications of the general framework of performative RL including multi-agent systems.

Performative Reinforcement Learning with Linear Markov Decision Process

TL;DR

The paper advances performative reinforcement learning to linear MDPs where deployed policies alter both rewards and dynamics. It develops a regularized, dual-aware optimization framework that yields last-iterate convergence to a performatively stable policy, aided by a new recurrence based on time-varying dual solutions. In the finite-sample regime, it introduces a reparameterization and an empirical Lagrangian solved via a saddle-point method, achieving polynomial-in- sample complexity under a bounded-coverage assumption. The approach generalizes to multi-agent settings, including stochastic Stackelberg games, and provides principled pathways for dimension-efficient learning in reactive environments. Overall, the work bridges performative RL theory with scalable linear-function approximation and practical saddle-point algorithms.

Abstract

We study the setting of \emph{performative reinforcement learning} where the deployed policy affects both the reward, and the transition of the underlying Markov decision process. Prior work~\parencite{MTR23} has addressed this problem under the tabular setting and established last-iterate convergence of repeated retraining with iteration complexity explicitly depending on the number of states. In this work, we generalize the results to \emph{linear Markov decision processes} which is the primary theoretical model of large-scale MDPs. The main challenge with linear MDP is that the regularized objective is no longer strongly convex and we want a bound that scales with the dimension of the features, rather than states which can be infinite. Our first result shows that repeatedly optimizing a regularized objective converges to a \emph{performatively stable policy}. In the absence of strong convexity, our analysis leverages a new recurrence relation that uses a specific linear combination of optimal dual solutions for proving convergence. We then tackle the finite sample setting where the learner has access to a set of trajectories drawn from the current policy. We consider a reparametrized version of the primal problem, and construct an empirical Lagrangian which is to be optimized from the samples. We show that, under a \emph{bounded coverage} condition, repeatedly solving a saddle point of this empirical Lagrangian converges to a performatively stable solution, and also construct a primal-dual algorithm that solves the empirical Lagrangian efficiently. Finally, we show several applications of the general framework of performative RL including multi-agent systems.

Paper Structure

This paper contains 23 sections, 24 theorems, 201 equations, 3 algorithms.

Key Result

Theorem 1

Suppose assumptions asn:measure-to-parameters, asn:lipschitzness, asn:features hold and $\alpha = \frac{\sqrt{\mathcal{M}}}{\sqrt{A}(1-\gamma)}$ and $\varepsilon_\mu < \frac{2 \sqrt{\kappa}}{25 \gamma \alpha^2}$. If alg:repeated-optimization is run with regularization parameter $\lambda > \frac{25\l where $r = \frac{5}{4} \sqrt{\frac{\varepsilon_\theta + \alpha \gamma \sqrt{D} \varepsilon_\mu }{

Theorems & Definitions (48)

  • Definition 1: Performatively Optimal Policy
  • Definition 2: Performatively Stable Policy
  • Theorem 1
  • Definition 3
  • Theorem 2
  • Theorem 3: Informal Statement
  • Theorem 4
  • Theorem 5
  • Corollary 1
  • Corollary 2
  • ...and 38 more