On Bellman equations for continuous-time policy evaluation I: discretization and approximation

Wenlong Mou; Yuhua Zhu

On Bellman equations for continuous-time policy evaluation I: discretization and approximation

Wenlong Mou, Yuhua Zhu

TL;DR

The paper introduces high-order, model-free discretization schemes to estimate the continuous-time diffusion value function $f^*$ from discrete trajectories, by combining high-order Bellman operators $\mathcal{T}^{(n)}$ and high-order generators $\mathcal{A}^{(n)}$ with function-approximation projections. By exploiting the elliptic structure of the underlying diffusion, the authors derive uniformly bounded approximation factors and high-order error bounds in both $\mathbb{L}^\infty$ and $\mathbb{H}^1$ norms, under suitable smoothness and ellipticity assumptions. They also provide data-driven implementations via empirical estimates over trajectories and extend guarantees to discounted occupancy measures, supported by numerical simulations that demonstrate the practical gains of second-order and higher schemes over naive discretizations. Overall, the work offers a principled, high-accuracy framework for continuous-time policy evaluation that integrates seamlessly with model-free RL using function approximation. This advances the ability to learn value functions for continuous-time systems from discrete-time data with provable error control and practical algorithms.

Abstract

We study the problem of computing the value function from a discretely-observed trajectory of a continuous-time diffusion process. We develop a new class of algorithms based on easily implementable numerical schemes that are compatible with discrete-time reinforcement learning (RL) with function approximation. We establish high-order numerical accuracy as well as the approximation error guarantees for the proposed approach. In contrast to discrete-time RL problems where the approximation factor depends on the effective horizon, we obtain a bounded approximation factor using the underlying elliptic structures, even if the effective horizon diverges to infinity.

On Bellman equations for continuous-time policy evaluation I: discretization and approximation

TL;DR

The paper introduces high-order, model-free discretization schemes to estimate the continuous-time diffusion value function

from discrete trajectories, by combining high-order Bellman operators

and high-order generators

with function-approximation projections. By exploiting the elliptic structure of the underlying diffusion, the authors derive uniformly bounded approximation factors and high-order error bounds in both

and

norms, under suitable smoothness and ellipticity assumptions. They also provide data-driven implementations via empirical estimates over trajectories and extend guarantees to discounted occupancy measures, supported by numerical simulations that demonstrate the practical gains of second-order and higher schemes over naive discretizations. Overall, the work offers a principled, high-accuracy framework for continuous-time policy evaluation that integrates seamlessly with model-free RL using function approximation. This advances the ability to learn value functions for continuous-time systems from discrete-time data with provable error control and practical algorithms.

Abstract

Paper Structure (45 sections, 21 theorems, 233 equations, 7 figures)

This paper contains 45 sections, 21 theorems, 233 equations, 7 figures.

Introduction
Notation.
Related work
Policy evaluation and (projected) fixed-points:
Computational methods in differential equations and control:
Machine learning for control and differential equations:
Problem setup
Time discretization for the value function
High-order approximation to the Bellman operator
High-order approximation to the diffusion generator
Practical algorithms via function approximation
General worst-case approximation guarantees
Improved approximation guarantees under ellipticity
Approximation guarantees for higher-order generator
Solving the projected-fixed points using empirical data
...and 30 more sections

Key Result

Theorem 1

If Assumption assume:smooth-high-order holds true for some integer $n > 0$, we have for a constant ${C}_n$ depending on $\{L_i^{b}\}_{i = 0}^{2 n - 2}$, $\{L_i^{\Lambda}\}_{i = 0}^{2 n - 2}$, $\{L_i^{r}\}_{i = 0}^{2 n}$ and problem dimension $d$.

Figures (7)

Figure 1: The above figure plots the error of the solution as the step size $\eta$ decreases. Left: The dynamics follow \ref{['deter-dyna']}, and the reward is \ref{['deter-reward-1']} with $\lambda = 0.05, k = 1, \beta = 0.1$ (above), and $\lambda = 0.01, k = 2, \beta = 2$ (below). Middle: The dynamics follow \ref{['deter-dyna']}, and the reward is \ref{['deter-reward-2']} with $\lambda = 0.01$, and $\alpha = 5, b = 1, \beta = 0.1$ (above), and $\alpha = 2, b = 2, \beta = 2$ (below). Right: The dynamics follow \ref{['stoch-dy']}, and the reward is \ref{['stoch-reward']} with $\sigma = 0.1, \beta = 0.1$ (above), and $\sigma = 1, \beta = 1$ (below).
Figure 2: The above figure plots the value functions for $\eta = 1$ with the same setting as the second row of Figure \ref{['fig:exact_dt']}
Figure 3: The above figure plots the error of the approximated solution as the number of data increases. The specific parameter choices are marked in sub-figure titles. The dynamics in panels (a)(c) follow Eq \ref{['deter-dyna']}, and the reward is Eq \ref{['deter-reward-1']}. The dynamics in panels (b)(d) follow Eq \ref{['deter-dyna']} with $\lambda = 0.01$, and the reward is Eq \ref{['deter-reward-2']}.
Figure 4: Plots of the mean-squared error ${\mathbb{E}} [ \|\widehat{f} - f^*\|_{\xi}^2 ]$ versus trajectory length $T$. Each curve corresponds to a different algorithm. Each marker corresponds to a Monte Carlo estimate based on the empirical average of $50$ independent runs. As indicated by the sub-figure titles, each panel corresponds to a fixed stepsize $\eta$. Both axes in the plots are given by logarithmic scales.
Figure 5: Plots of the mean-squared error ${\mathbb{E}} [ \|\widehat{f} - f^*\|_{\xi}^2 ]$ versus stepsize $\eta$. Each curve corresponds to a different algorithm. Each marker corresponds to a Monte Carlo estimate based on the empirical average of $50$ independent runs. As indicated by the sub-figure titles, each panel corresponds to a fixed total time $T$. Both axes in the plots are given by logarithmic scales.
...and 2 more figures

Theorems & Definitions (26)

Example 1: Linear quadratic systems
Example 2: Langevin diffusion
Theorem 1
Proposition 1
Corollary 1
Proposition 2
Theorem 2
Theorem 3
Corollary 2
Proposition 3
...and 16 more

On Bellman equations for continuous-time policy evaluation I: discretization and approximation

TL;DR

Abstract

On Bellman equations for continuous-time policy evaluation I: discretization and approximation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (26)