Table of Contents
Fetching ...

Accuracy of Discretely Sampled Stochastic Policies in Continuous-time Reinforcement Learning

Yanwei Jia, Du Ouyang, Yufei Zhang

TL;DR

This work introduces and rigorously analyzes a policy execution framework that samples actions from a stochastic policy at discrete time points and implements them as piecewise constant controls and quantifies the convergence rate based on the regularity of the coefficients and establishes an optimal first-order convergence rate for sufficiently regular coefficients.

Abstract

Stochastic policies (also known as relaxed controls) are widely used in continuous-time reinforcement learning algorithms. However, executing a stochastic policy and evaluating its performance in a continuous-time environment remain open challenges. This work introduces and rigorously analyzes a policy execution framework that samples actions from a stochastic policy at discrete time points and implements them as piecewise constant controls. We prove that as the sampling mesh size tends to zero, the controlled state process converges weakly to the dynamics with coefficients aggregated according to the stochastic policy. We explicitly quantify the convergence rate based on the regularity of the coefficients and establish an optimal first-order convergence rate for sufficiently regular coefficients. Additionally, we prove a $1/2$-order weak convergence rate that holds uniformly over the sampling noise with high probability, and establish a $1/2$-order pathwise convergence for each realization of the system noise in the absence of volatility control. Building on these results, we analyze the bias and variance of various policy evaluation and policy gradient estimators based on discrete-time observations. Our results provide theoretical justification for the exploratory stochastic control framework in [H. Wang, T. Zariphopoulou, and X.Y. Zhou, J. Mach. Learn. Res., 21 (2020), pp. 1-34].

Accuracy of Discretely Sampled Stochastic Policies in Continuous-time Reinforcement Learning

TL;DR

This work introduces and rigorously analyzes a policy execution framework that samples actions from a stochastic policy at discrete time points and implements them as piecewise constant controls and quantifies the convergence rate based on the regularity of the coefficients and establishes an optimal first-order convergence rate for sufficiently regular coefficients.

Abstract

Stochastic policies (also known as relaxed controls) are widely used in continuous-time reinforcement learning algorithms. However, executing a stochastic policy and evaluating its performance in a continuous-time environment remain open challenges. This work introduces and rigorously analyzes a policy execution framework that samples actions from a stochastic policy at discrete time points and implements them as piecewise constant controls. We prove that as the sampling mesh size tends to zero, the controlled state process converges weakly to the dynamics with coefficients aggregated according to the stochastic policy. We explicitly quantify the convergence rate based on the regularity of the coefficients and establish an optimal first-order convergence rate for sufficiently regular coefficients. Additionally, we prove a -order weak convergence rate that holds uniformly over the sampling noise with high probability, and establish a -order pathwise convergence for each realization of the system noise in the absence of volatility control. Building on these results, we analyze the bias and variance of various policy evaluation and policy gradient estimators based on discrete-time observations. Our results provide theoretical justification for the exploratory stochastic control framework in [H. Wang, T. Zariphopoulou, and X.Y. Zhou, J. Mach. Learn. Res., 21 (2020), pp. 1-34].

Paper Structure

This paper contains 29 sections, 16 theorems, 147 equations, 2 figures.

Key Result

Lemma 3.1

Suppose (H.assum:standing) holds. Let $x_0\in {\mathbb R}^d$, $(\Omega,\mathcal{F},{\mathbb{F}}, \mathbb P)$ be the probability space defined in eq:space, and ${\mathbb{F}}=(\mathcal{F}_t)_{t\ge 0}$ be the filtration such that $\mathcal{F}_t \coloneqq \sigma\{ (W_s)_{s\leq t}, (\xi_i)_{i=0}^\infty\

Figures (2)

  • Figure 1: Weak and strong convergence analysis for the uncontrolled volatility case $\mathrm{d} X_t = a_t\mathrm{d} t+ \mathrm{d} W_t$ with actions $a_t\sim \mathcal{N}(0,1)$. The corresponding aggregated dynamics is $\mathrm{d} \tilde{X}_t = \mathrm{d} W_t$. The test function is $f(x)=x^4$ and $T=1$. Left: Weak error versus the number of grid points $n$. Right: RMSE versus the number of grid points $n$. Both axes are on log scales.
  • Figure 2: Weak and strong convergence analysis for the controlled volatility case $\mathrm{d} X_t = a_t \mathrm{d} W_t$ with actions $a_t\sim \mathcal{N}(0,1)$. The corresponding aggregated dynamics is $\mathrm{d} \tilde{X}_t = \mathrm{d} W_t$. The test function is $f(x)=x^4$ and $T=1$. Left: Weak error versus the number of grid points $n$. Right: RMSE versus the number of grid points $n$. Both axes are on log scales.

Theorems & Definitions (46)

  • Definition 3.1
  • Remark 3.1
  • Remark 3.2
  • Lemma 3.1
  • Remark 4.1
  • Theorem 4.1
  • proof
  • Example 4.1: General smooth coefficients
  • Example 4.2: Gaussian policies
  • Example 4.3: Uncontrolled volatility
  • ...and 36 more