Table of Contents
Fetching ...

Performative Reinforcement Learning in Gradually Shifting Environments

Ben Rank, Stelios Triantafyllou, Debmalya Mandal, Goran Radanovic

TL;DR

MDRR is the first algorithm in this setting which combines samples from multiple deployments in its training, which makes MDRR particularly suitable for scenarios where the environment's response strongly depends on its previous dynamics, which are common in practice.

Abstract

When Reinforcement Learning (RL) agents are deployed in practice, they might impact their environment and change its dynamics. We propose a new framework to model this phenomenon, where the current environment depends on the deployed policy as well as its previous dynamics. This is a generalization of Performative RL (PRL) [Mandal et al., 2023]. Unlike PRL, our framework allows to model scenarios where the environment gradually adjusts to a deployed policy. We adapt two algorithms from the performative prediction literature to our setting and propose a novel algorithm called Mixed Delayed Repeated Retraining (MDRR). We provide conditions under which these algorithms converge and compare them using three metrics: number of retrainings, approximation guarantee, and number of samples per deployment. MDRR is the first algorithm in this setting which combines samples from multiple deployments in its training. This makes MDRR particularly suitable for scenarios where the environment's response strongly depends on its previous dynamics, which are common in practice. We experimentally compare the algorithms using a simulation-based testbed and our results show that MDRR converges significantly faster than previous approaches.

Performative Reinforcement Learning in Gradually Shifting Environments

TL;DR

MDRR is the first algorithm in this setting which combines samples from multiple deployments in its training, which makes MDRR particularly suitable for scenarios where the environment's response strongly depends on its previous dynamics, which are common in practice.

Abstract

When Reinforcement Learning (RL) agents are deployed in practice, they might impact their environment and change its dynamics. We propose a new framework to model this phenomenon, where the current environment depends on the deployed policy as well as its previous dynamics. This is a generalization of Performative RL (PRL) [Mandal et al., 2023]. Unlike PRL, our framework allows to model scenarios where the environment gradually adjusts to a deployed policy. We adapt two algorithms from the performative prediction literature to our setting and propose a novel algorithm called Mixed Delayed Repeated Retraining (MDRR). We provide conditions under which these algorithms converge and compare them using three metrics: number of retrainings, approximation guarantee, and number of samples per deployment. MDRR is the first algorithm in this setting which combines samples from multiple deployments in its training. This makes MDRR particularly suitable for scenarios where the environment's response strongly depends on its previous dynamics, which are common in practice. We experimentally compare the algorithms using a simulation-based testbed and our results show that MDRR converges significantly faster than previous approaches.
Paper Structure (45 sections, 24 theorems, 129 equations, 4 figures, 2 tables, 4 algorithms)

This paper contains 45 sections, 24 theorems, 129 equations, 4 figures, 2 tables, 4 algorithms.

Key Result

Theorem 1

Assume that Assumption assumption_sensitivity holds and $\lambda={\mathcal{O}}\left(\frac{|S|^{5/2}}{(1-\epsilon)(1-\gamma)^4}\right)$. Then for any $\delta > 0$ we have, $\left\lVert d_t - d_S\right\rVert_2 \leq \delta ,$ for all $t \geq \frac{\ln\left(\left( \frac{2}{1-\gamma} + \left(1+\sqrt{2}\r

Figures (4)

  • Figure 1: The figures show the distance of the current occupancy measure from the average of the last 10 in that run (after 11990 deployments). The data represent means computed over 20 trials, along with their 95% confidence intervals. Unless otherwise noted, the settings are $k=3$ for DRR and MDRR, $v=1.1$ for MDRR, 1000 trajectories per iteration, $B=10$, $\lambda=0.1$ and $w=0.5$ Figure \ref{['fig:samples1000-to-last-iteration']} and \ref{['fig:samples1000-to-last-iteration-w15']} compare the three algorithms to one another, while Figure \ref{['fig:ks-to-last-iteration']} compares MDRR with different values for the hyperparameter $k$ and Figure \ref{['fig:vs-to-last-iteration']} compares MDRR with different values for the hyperparameter $v$.
  • Figure 2: The grid-world.
  • Figure 3: A sanity check if the algorithms reach valid solutions. Since the values of the three algorithms are close to one another, we assert that none of them reaches a much less optimal solution than another one, thereby validating all three approaches.
  • Figure 4: Convergence plots for less stationary environments, i.e. larger values of $w$. Data generated as in Figure \ref{['fig:main_plots']}. Also here MDRR outperforms the other algorithms.

Theorems & Definitions (49)

  • Definition 1: Performatively Stable Policy
  • Theorem 1: informal, details in Appendix \ref{['appdx.rr-exact']}
  • Theorem 2: informal, details in Appendix \ref{['appdx.rr-finite']}
  • Definition 2
  • Theorem 3: informal, details in Appendix \ref{['appdx.sec.drr-exact']}
  • Theorem 4: informal, details in Appendix \ref{['appdx.sec.drr-finite']}
  • Theorem 5: informal, details in Appendix \ref{['appdx.sec.mdrr-theorem-sec']}
  • Theorem 6
  • proof : Proof of Theorem \ref{['thm.practical-sampling-mdrr']}
  • Proposition 1
  • ...and 39 more