Table of Contents
Fetching ...

PhiBE: A PDE-based Bellman Equation for Continuous Time Policy Evaluation

Yuhua Zhu

TL;DR

PhiBE introduces a PDE-based Bellman equation that embeds discrete-time transition information into a continuous-time policy evaluation framework, addressing the limitations of the conventional BE in SDE-driven environments. By deriving deterministic and stochastic forms and extending to higher orders, PhiBE achieves smaller discretization errors and remains well-conditioned as the sampling interval shrinks. A model-free Galerkin algorithm is developed to solve PhiBE from trajectory data, with convergence guarantees under model misspecification and strong theoretical results for both deterministic and stochastic dynamics. Empirical results across deterministic, stochastic, and stabilization tasks corroborate the theory, showing improved accuracy and data efficiency over BE and LSTD. The work lays groundwork for robust continuous-time RL, highlighting PhiBE’s resilience to reward oscillations and slow dynamics, and offering a practical approach for real-world continuous-time systems.

Abstract

In this paper, we study policy evaluation in continuous-time reinforcement learning, where the state follows an unknown stochastic differential equation (SDE) but only discrete-time data are available. We first highlight that the discrete-time Bellman equation (BE) is not always a reliable approximation to the true value function because it ignores the underlying continuous-time structure. We then introduce a new bellman equation, PhiBE, which integrates the discrete-time information into a continuous-time PDE formulation. By leveraging the smooth SDE structure of the underlying dynamics, PhiBE provides a provably more accurate approximation to the true value function, especially in scenarios where the underlying dynamics change slowly or the reward oscillates. Moreover, we extend PhiBE to higher orders, providing increasingly accurate approximations. We further develop a model-free algorithm for PhiBE under linear function approximation and establish its convergence under model misspecification. In contrast to existing RL analyses that diverges as the sampling interval shrinks, the approximation error of PhiBE remains remains well-conditioned and independent of the discretization step by exploiting the smoothness of the underlying dynamics. Numerical experiments are provided to validate the theoretical guarantees we propose.

PhiBE: A PDE-based Bellman Equation for Continuous Time Policy Evaluation

TL;DR

PhiBE introduces a PDE-based Bellman equation that embeds discrete-time transition information into a continuous-time policy evaluation framework, addressing the limitations of the conventional BE in SDE-driven environments. By deriving deterministic and stochastic forms and extending to higher orders, PhiBE achieves smaller discretization errors and remains well-conditioned as the sampling interval shrinks. A model-free Galerkin algorithm is developed to solve PhiBE from trajectory data, with convergence guarantees under model misspecification and strong theoretical results for both deterministic and stochastic dynamics. Empirical results across deterministic, stochastic, and stabilization tasks corroborate the theory, showing improved accuracy and data efficiency over BE and LSTD. The work lays groundwork for robust continuous-time RL, highlighting PhiBE’s resilience to reward oscillations and slow dynamics, and offering a practical approach for real-world continuous-time systems.

Abstract

In this paper, we study policy evaluation in continuous-time reinforcement learning, where the state follows an unknown stochastic differential equation (SDE) but only discrete-time data are available. We first highlight that the discrete-time Bellman equation (BE) is not always a reliable approximation to the true value function because it ignores the underlying continuous-time structure. We then introduce a new bellman equation, PhiBE, which integrates the discrete-time information into a continuous-time PDE formulation. By leveraging the smooth SDE structure of the underlying dynamics, PhiBE provides a provably more accurate approximation to the true value function, especially in scenarios where the underlying dynamics change slowly or the reward oscillates. Moreover, we extend PhiBE to higher orders, providing increasingly accurate approximations. We further develop a model-free algorithm for PhiBE under linear function approximation and establish its convergence under model misspecification. In contrast to existing RL analyses that diverges as the sampling interval shrinks, the approximation error of PhiBE remains remains well-conditioned and independent of the discretization step by exploiting the smoothness of the underlying dynamics. Numerical experiments are provided to validate the theoretical guarantees we propose.
Paper Structure (49 sections, 15 theorems, 235 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 49 sections, 15 theorems, 235 equations, 8 figures, 1 table, 2 algorithms.

Key Result

Theorem 3.1

Assume that $\left\lVert r\right\rVert_{L^\infty},\left\lVert \mathcal{L}_{\mu,\Sigma} r \right\rVert_{L^\infty}$ are bounded, then the solution $\tilde{V}(s)$ to the BE bellman approximates the true value function $V(s)$ defined in def of value with an error where with $\Sigma = \sigma \sigma^\top$, and $\Sigma : \nabla^2 = \sum_{i,j}\Sigma_{ij}\partial_{s_i}\partial_{s_j}$.

Figures (8)

  • Figure 1: Here the data are collected every ${\Delta t}$ unite of time, $\beta$ is the discount coefficient, and $V(s)$ is the true value function. In our setting, a larger discount coefficient indicates that future rewards are discounted more. LSTD bradtke1996linear is a popular RL algorithm for linear function approximation. The PhiBE is proposed in Section \ref{['sec:phibe']} and the algorithm is proposed in Section \ref{['sec:algo']}.
  • Figure 2: The PhiBE solution and the BE solution, when the discrete-time transition dynamics are given, are plotted in solid lines. The approximated PhiBE solution based on Algorithm \ref{['algo:galerkin_phibe_deter']} and the approximated BE solution based on LSTD, when discrete-time data are given, are plotted in dash lines. Both algorithms utilize the same data points.
  • Figure 3: The $L^2$ error \ref{['def of l2 error']} of the PhiBE solutions and the BE solutions with decreasing ${\Delta t}$ are plotted in the left two figures. The $L^2$ error \ref{['def of l2 error']} of the approximated PhiBE solutions and the approximated BE solutions with increasing amount of data collected every ${\Delta t} = 5$ unit of time are plotted in the right two figures. The solid lines are the average over $100$ simulations for all different number of data. We set $\lambda = 0.05, \beta = 0.1, k = 1$ in both linear and nonlinear cases.
  • Figure 4: The PhiBE solution and the BE solution, when the discrete-time transition dynamics are given, are plotted in solid lines. The approximated PhiBE solution based on Algorithm \ref{['algo:galerkin_phibe_stoch']} and the approximated BE solution based on LSTD, when discrete-time data are given, are plotted in dash lines. Both algorithms utilize the same data points.
  • Figure 5: The $L^2$ error \ref{['def of l2 error']} of the PhiBE solutions and the BE solutions with decreasing ${\Delta t}$ are plotted in (a). The $L^2$ error \ref{['def of l2 error']} of the approximated PhiBE solutions and the approximated BE solutions with increasing amount of data collected every ${\Delta t} = 1$ unit of time are plotted in (b). The solid lines are the average over $100$ simulations. We set $\beta = 0.1, k = 1$ in both figures.
  • ...and 3 more figures

Theorems & Definitions (30)

  • Remark 1
  • Definition 1: Definition of BE
  • Theorem 3.1
  • Remark 2: Assumptions on $\left\lVert \mathcal{L}_{\mu,\Sigma} r \right\rVert_{L^\infty}$
  • Definition 2: i-th order PhiBE in deterministic dynamics
  • Remark 3
  • Theorem 3.2
  • Theorem 3.3
  • Definition 3: i-th order PhiBE in stochastic dynamics
  • Remark 4
  • ...and 20 more