Table of Contents
Fetching ...

On Bellman equations for continuous-time policy evaluation I: discretization and approximation

Wenlong Mou, Yuhua Zhu

TL;DR

The paper introduces high-order, model-free discretization schemes to estimate the continuous-time diffusion value function $f^*$ from discrete trajectories, by combining high-order Bellman operators $\mathcal{T}^{(n)}$ and high-order generators $\mathcal{A}^{(n)}$ with function-approximation projections. By exploiting the elliptic structure of the underlying diffusion, the authors derive uniformly bounded approximation factors and high-order error bounds in both $\mathbb{L}^\infty$ and $\mathbb{H}^1$ norms, under suitable smoothness and ellipticity assumptions. They also provide data-driven implementations via empirical estimates over trajectories and extend guarantees to discounted occupancy measures, supported by numerical simulations that demonstrate the practical gains of second-order and higher schemes over naive discretizations. Overall, the work offers a principled, high-accuracy framework for continuous-time policy evaluation that integrates seamlessly with model-free RL using function approximation. This advances the ability to learn value functions for continuous-time systems from discrete-time data with provable error control and practical algorithms.

Abstract

We study the problem of computing the value function from a discretely-observed trajectory of a continuous-time diffusion process. We develop a new class of algorithms based on easily implementable numerical schemes that are compatible with discrete-time reinforcement learning (RL) with function approximation. We establish high-order numerical accuracy as well as the approximation error guarantees for the proposed approach. In contrast to discrete-time RL problems where the approximation factor depends on the effective horizon, we obtain a bounded approximation factor using the underlying elliptic structures, even if the effective horizon diverges to infinity.

On Bellman equations for continuous-time policy evaluation I: discretization and approximation

TL;DR

The paper introduces high-order, model-free discretization schemes to estimate the continuous-time diffusion value function from discrete trajectories, by combining high-order Bellman operators and high-order generators with function-approximation projections. By exploiting the elliptic structure of the underlying diffusion, the authors derive uniformly bounded approximation factors and high-order error bounds in both and norms, under suitable smoothness and ellipticity assumptions. They also provide data-driven implementations via empirical estimates over trajectories and extend guarantees to discounted occupancy measures, supported by numerical simulations that demonstrate the practical gains of second-order and higher schemes over naive discretizations. Overall, the work offers a principled, high-accuracy framework for continuous-time policy evaluation that integrates seamlessly with model-free RL using function approximation. This advances the ability to learn value functions for continuous-time systems from discrete-time data with provable error control and practical algorithms.

Abstract

We study the problem of computing the value function from a discretely-observed trajectory of a continuous-time diffusion process. We develop a new class of algorithms based on easily implementable numerical schemes that are compatible with discrete-time reinforcement learning (RL) with function approximation. We establish high-order numerical accuracy as well as the approximation error guarantees for the proposed approach. In contrast to discrete-time RL problems where the approximation factor depends on the effective horizon, we obtain a bounded approximation factor using the underlying elliptic structures, even if the effective horizon diverges to infinity.
Paper Structure (45 sections, 21 theorems, 233 equations, 7 figures)

This paper contains 45 sections, 21 theorems, 233 equations, 7 figures.

Key Result

Theorem 1

If Assumption assume:smooth-high-order holds true for some integer $n > 0$, we have for a constant ${C}_n$ depending on $\{L_i^{b}\}_{i = 0}^{2 n - 2}$, $\{L_i^{\Lambda}\}_{i = 0}^{2 n - 2}$, $\{L_i^{r}\}_{i = 0}^{2 n}$ and problem dimension $d$.

Figures (7)

  • Figure 1: The above figure plots the error of the solution as the step size $\eta$ decreases. Left: The dynamics follow \ref{['deter-dyna']}, and the reward is \ref{['deter-reward-1']} with $\lambda = 0.05, k = 1, \beta = 0.1$ (above), and $\lambda = 0.01, k = 2, \beta = 2$ (below). Middle: The dynamics follow \ref{['deter-dyna']}, and the reward is \ref{['deter-reward-2']} with $\lambda = 0.01$, and $\alpha = 5, b = 1, \beta = 0.1$ (above), and $\alpha = 2, b = 2, \beta = 2$ (below). Right: The dynamics follow \ref{['stoch-dy']}, and the reward is \ref{['stoch-reward']} with $\sigma = 0.1, \beta = 0.1$ (above), and $\sigma = 1, \beta = 1$ (below).
  • Figure 2: The above figure plots the value functions for $\eta = 1$ with the same setting as the second row of Figure \ref{['fig:exact_dt']}
  • Figure 3: The above figure plots the error of the approximated solution as the number of data increases. The specific parameter choices are marked in sub-figure titles. The dynamics in panels (a)(c) follow Eq \ref{['deter-dyna']}, and the reward is Eq \ref{['deter-reward-1']}. The dynamics in panels (b)(d) follow Eq \ref{['deter-dyna']} with $\lambda = 0.01$, and the reward is Eq \ref{['deter-reward-2']}.
  • Figure 4: Plots of the mean-squared error ${\mathbb{E}} [ \|\widehat{f} - f^*\|_{\xi}^2 ]$ versus trajectory length $T$. Each curve corresponds to a different algorithm. Each marker corresponds to a Monte Carlo estimate based on the empirical average of $50$ independent runs. As indicated by the sub-figure titles, each panel corresponds to a fixed stepsize $\eta$. Both axes in the plots are given by logarithmic scales.
  • Figure 5: Plots of the mean-squared error ${\mathbb{E}} [ \|\widehat{f} - f^*\|_{\xi}^2 ]$ versus stepsize $\eta$. Each curve corresponds to a different algorithm. Each marker corresponds to a Monte Carlo estimate based on the empirical average of $50$ independent runs. As indicated by the sub-figure titles, each panel corresponds to a fixed total time $T$. Both axes in the plots are given by logarithmic scales.
  • ...and 2 more figures

Theorems & Definitions (26)

  • Example 1: Linear quadratic systems
  • Example 2: Langevin diffusion
  • Theorem 1
  • Proposition 1
  • Corollary 1
  • Proposition 2
  • Theorem 2
  • Theorem 3
  • Corollary 2
  • Proposition 3
  • ...and 16 more