Optimal-PhiBE: A PDE-based Model-free framework for Continuous-time Reinforcement Learning
Yuhua Zhu, Yuming Zhang, Haoyu Zhang
TL;DR
This work addresses CTRL with unknown SDE dynamics using only discrete-time data by introducing Optimal-PhiBE, a PDE-based Bellman equation that embeds discrete-time information into the continuous-time HJB framework. It provides i-th order PhiBE formulations, sharp error bounds for LQR, and shows that in the undiscounted case Optimal-PhiBE exactly recovers the optimal policy, with improved robustness to reward oscillations and slow dynamics. A model-free policy-iteration algorithm is proposed to solve Optimal-PhiBE directly from trajectory data via Galerkin or gradient-based methods, enabling data-efficient, model-free learning in continuous time. The approach is validated on LQR and Merton’s portfolio problems, demonstrating reduced discretization error and superior performance over Optimal-BE in many regimes, including weak discounting and control-dominant dynamics, highlighting its practical potential for CTRL with discrete observations.
Abstract
This paper addresses continuous-time reinforcement learning (CTRL) where the system dynamics are governed by an unknown stochastic differential equation, and only discrete-time observations are available. Existing approaches face limitations: model-based PDE methods suffer from non-identifiability, while model-free methods based on the discrete-time optimal Bellman equation (Optimal-BE) suffer from large discretization errors that are highly sensitive to both the system dynamics and the reward structure. To overcome these challenges, we introduce Optimal-PhiBE, a formulation that integrates discrete-time information into a continuous-time PDE, combining the strength of both existing frameworks while mitigating their limitations. Optimal-PhiBE exhibits smaller discretization errors when the uncontrolled system evolves slowly, and demonstrates reduced sensitivity to oscillatory reward structures, and enables model-free algorithms that bypass explicit dynamics estimation. In the linear-quadratic regulator (LQR) setting, sharp error bounds are established for both Optimal-PhiBE and Optimal-BE. The results show that Optimal-PhiBE exactly recovers the optimal policy in the undiscounted case and substantially outperforms Optimal-BE when the problem is weakly discounted or control-dominant. Furthermore, we extend Optimal-PhiBE to higher orders, providing increasingly accurate approximations. A model-free policy iteration algorithm is proposed to solve the Optimal-PhiBE directly from trajectory data. Numerical experiments are conducted to verify the theoretical findings.
