Performative Reinforcement Learning with Linear Markov Decision Process
Debmalya Mandal, Goran Radanovic
TL;DR
The paper advances performative reinforcement learning to linear MDPs where deployed policies alter both rewards and dynamics. It develops a regularized, dual-aware optimization framework that yields last-iterate convergence to a performatively stable policy, aided by a new recurrence based on time-varying dual solutions. In the finite-sample regime, it introduces a reparameterization and an empirical Lagrangian solved via a saddle-point method, achieving polynomial-in-$D$ sample complexity under a bounded-coverage assumption. The approach generalizes to multi-agent settings, including stochastic Stackelberg games, and provides principled pathways for dimension-efficient learning in reactive environments. Overall, the work bridges performative RL theory with scalable linear-function approximation and practical saddle-point algorithms.
Abstract
We study the setting of \emph{performative reinforcement learning} where the deployed policy affects both the reward, and the transition of the underlying Markov decision process. Prior work~\parencite{MTR23} has addressed this problem under the tabular setting and established last-iterate convergence of repeated retraining with iteration complexity explicitly depending on the number of states. In this work, we generalize the results to \emph{linear Markov decision processes} which is the primary theoretical model of large-scale MDPs. The main challenge with linear MDP is that the regularized objective is no longer strongly convex and we want a bound that scales with the dimension of the features, rather than states which can be infinite. Our first result shows that repeatedly optimizing a regularized objective converges to a \emph{performatively stable policy}. In the absence of strong convexity, our analysis leverages a new recurrence relation that uses a specific linear combination of optimal dual solutions for proving convergence. We then tackle the finite sample setting where the learner has access to a set of trajectories drawn from the current policy. We consider a reparametrized version of the primal problem, and construct an empirical Lagrangian which is to be optimized from the samples. We show that, under a \emph{bounded coverage} condition, repeatedly solving a saddle point of this empirical Lagrangian converges to a performatively stable solution, and also construct a primal-dual algorithm that solves the empirical Lagrangian efficiently. Finally, we show several applications of the general framework of performative RL including multi-agent systems.
