PhiBE: A PDE-based Bellman Equation for Continuous Time Policy Evaluation
Yuhua Zhu
TL;DR
PhiBE introduces a PDE-based Bellman equation that embeds discrete-time transition information into a continuous-time policy evaluation framework, addressing the limitations of the conventional BE in SDE-driven environments. By deriving deterministic and stochastic forms and extending to higher orders, PhiBE achieves smaller discretization errors and remains well-conditioned as the sampling interval shrinks. A model-free Galerkin algorithm is developed to solve PhiBE from trajectory data, with convergence guarantees under model misspecification and strong theoretical results for both deterministic and stochastic dynamics. Empirical results across deterministic, stochastic, and stabilization tasks corroborate the theory, showing improved accuracy and data efficiency over BE and LSTD. The work lays groundwork for robust continuous-time RL, highlighting PhiBE’s resilience to reward oscillations and slow dynamics, and offering a practical approach for real-world continuous-time systems.
Abstract
In this paper, we study policy evaluation in continuous-time reinforcement learning, where the state follows an unknown stochastic differential equation (SDE) but only discrete-time data are available. We first highlight that the discrete-time Bellman equation (BE) is not always a reliable approximation to the true value function because it ignores the underlying continuous-time structure. We then introduce a new bellman equation, PhiBE, which integrates the discrete-time information into a continuous-time PDE formulation. By leveraging the smooth SDE structure of the underlying dynamics, PhiBE provides a provably more accurate approximation to the true value function, especially in scenarios where the underlying dynamics change slowly or the reward oscillates. Moreover, we extend PhiBE to higher orders, providing increasingly accurate approximations. We further develop a model-free algorithm for PhiBE under linear function approximation and establish its convergence under model misspecification. In contrast to existing RL analyses that diverges as the sampling interval shrinks, the approximation error of PhiBE remains remains well-conditioned and independent of the discretization step by exploiting the smoothness of the underlying dynamics. Numerical experiments are provided to validate the theoretical guarantees we propose.
