Table of Contents
Fetching ...

Achieve Performatively Optimal Policy for Performative Reinforcement Learning

Ziyi Chen, Heng Huang

TL;DR

This work proposes a zeroth-order Frank-Wolfe algorithm (0-FW) algorithm with a zeroth-order approximation of the performative policy gradient in the Frank-Wolfe framework, and obtains the first polynomial-time convergence to the desired PO policy under the standard regularizer dominance condition.

Abstract

Performative reinforcement learning is an emerging dynamical decision making framework, which extends reinforcement learning to the common applications where the agent's policy can change the environmental dynamics. Existing works on performative reinforcement learning only aim at a performatively stable (PS) policy that maximizes an approximate value function. However, there is a provably positive constant gap between the PS policy and the desired performatively optimal (PO) policy that maximizes the original value function. In contrast, this work proposes a zeroth-order Frank-Wolfe algorithm (0-FW) algorithm with a zeroth-order approximation of the performative policy gradient in the Frank-Wolfe framework, and obtains \textbf{the first polynomial-time convergence to the desired PO} policy under the standard regularizer dominance condition. For the convergence analysis, we prove two important properties of the nonconvex value function. First, when the policy regularizer dominates the environmental shift, the value function satisfies a certain gradient dominance property, so that any stationary point (not PS) of the value function is a desired PO. Second, though the value function has unbounded gradient, we prove that all the sufficiently stationary points lie in a convex and compact policy subspace $Π_Δ$, where the policy value has a constant lower bound $Δ>0$ and thus the gradient becomes bounded and Lipschitz continuous. Experimental results also demonstrate that our 0-FW algorithm is more effective than the existing algorithms in finding the desired PO policy.

Achieve Performatively Optimal Policy for Performative Reinforcement Learning

TL;DR

This work proposes a zeroth-order Frank-Wolfe algorithm (0-FW) algorithm with a zeroth-order approximation of the performative policy gradient in the Frank-Wolfe framework, and obtains the first polynomial-time convergence to the desired PO policy under the standard regularizer dominance condition.

Abstract

Performative reinforcement learning is an emerging dynamical decision making framework, which extends reinforcement learning to the common applications where the agent's policy can change the environmental dynamics. Existing works on performative reinforcement learning only aim at a performatively stable (PS) policy that maximizes an approximate value function. However, there is a provably positive constant gap between the PS policy and the desired performatively optimal (PO) policy that maximizes the original value function. In contrast, this work proposes a zeroth-order Frank-Wolfe algorithm (0-FW) algorithm with a zeroth-order approximation of the performative policy gradient in the Frank-Wolfe framework, and obtains \textbf{the first polynomial-time convergence to the desired PO} policy under the standard regularizer dominance condition. For the convergence analysis, we prove two important properties of the nonconvex value function. First, when the policy regularizer dominates the environmental shift, the value function satisfies a certain gradient dominance property, so that any stationary point (not PS) of the value function is a desired PO. Second, though the value function has unbounded gradient, we prove that all the sufficiently stationary points lie in a convex and compact policy subspace , where the policy value has a constant lower bound and thus the gradient becomes bounded and Lipschitz continuous. Experimental results also demonstrate that our 0-FW algorithm is more effective than the existing algorithms in finding the desired PO policy.

Paper Structure

This paper contains 34 sections, 19 theorems, 154 equations, 2 figures, 1 algorithm.

Key Result

Theorem 1

Under Assumptions assum:sensitive-assum:dmin, the entropy regularized value function (eq:Vfunc) satisfies the following gradient dominance property for any $\pi_0,\pi_1\in\Pi$. where

Figures (2)

  • Figure 1: Experimental Results.
  • Figure : Zeroth-order Frank-Wolfe (0-FW) Algorithm

Theorems & Definitions (31)

  • Definition 1: Ultimate Goal: PO
  • Definition 2: Stationary Policy
  • Theorem 1: Gradient Dominance
  • Corollary 1
  • Theorem 2
  • Theorem 3
  • Proposition 1
  • Lemma 1
  • Theorem 4
  • Proposition 2
  • ...and 21 more