Table of Contents
Fetching ...

Optimal-PhiBE: A PDE-based Model-free framework for Continuous-time Reinforcement Learning

Yuhua Zhu, Yuming Zhang, Haoyu Zhang

TL;DR

This work addresses CTRL with unknown SDE dynamics using only discrete-time data by introducing Optimal-PhiBE, a PDE-based Bellman equation that embeds discrete-time information into the continuous-time HJB framework. It provides i-th order PhiBE formulations, sharp error bounds for LQR, and shows that in the undiscounted case Optimal-PhiBE exactly recovers the optimal policy, with improved robustness to reward oscillations and slow dynamics. A model-free policy-iteration algorithm is proposed to solve Optimal-PhiBE directly from trajectory data via Galerkin or gradient-based methods, enabling data-efficient, model-free learning in continuous time. The approach is validated on LQR and Merton’s portfolio problems, demonstrating reduced discretization error and superior performance over Optimal-BE in many regimes, including weak discounting and control-dominant dynamics, highlighting its practical potential for CTRL with discrete observations.

Abstract

This paper addresses continuous-time reinforcement learning (CTRL) where the system dynamics are governed by an unknown stochastic differential equation, and only discrete-time observations are available. Existing approaches face limitations: model-based PDE methods suffer from non-identifiability, while model-free methods based on the discrete-time optimal Bellman equation (Optimal-BE) suffer from large discretization errors that are highly sensitive to both the system dynamics and the reward structure. To overcome these challenges, we introduce Optimal-PhiBE, a formulation that integrates discrete-time information into a continuous-time PDE, combining the strength of both existing frameworks while mitigating their limitations. Optimal-PhiBE exhibits smaller discretization errors when the uncontrolled system evolves slowly, and demonstrates reduced sensitivity to oscillatory reward structures, and enables model-free algorithms that bypass explicit dynamics estimation. In the linear-quadratic regulator (LQR) setting, sharp error bounds are established for both Optimal-PhiBE and Optimal-BE. The results show that Optimal-PhiBE exactly recovers the optimal policy in the undiscounted case and substantially outperforms Optimal-BE when the problem is weakly discounted or control-dominant. Furthermore, we extend Optimal-PhiBE to higher orders, providing increasingly accurate approximations. A model-free policy iteration algorithm is proposed to solve the Optimal-PhiBE directly from trajectory data. Numerical experiments are conducted to verify the theoretical findings.

Optimal-PhiBE: A PDE-based Model-free framework for Continuous-time Reinforcement Learning

TL;DR

This work addresses CTRL with unknown SDE dynamics using only discrete-time data by introducing Optimal-PhiBE, a PDE-based Bellman equation that embeds discrete-time information into the continuous-time HJB framework. It provides i-th order PhiBE formulations, sharp error bounds for LQR, and shows that in the undiscounted case Optimal-PhiBE exactly recovers the optimal policy, with improved robustness to reward oscillations and slow dynamics. A model-free policy-iteration algorithm is proposed to solve Optimal-PhiBE directly from trajectory data via Galerkin or gradient-based methods, enabling data-efficient, model-free learning in continuous time. The approach is validated on LQR and Merton’s portfolio problems, demonstrating reduced discretization error and superior performance over Optimal-BE in many regimes, including weak discounting and control-dominant dynamics, highlighting its practical potential for CTRL with discrete observations.

Abstract

This paper addresses continuous-time reinforcement learning (CTRL) where the system dynamics are governed by an unknown stochastic differential equation, and only discrete-time observations are available. Existing approaches face limitations: model-based PDE methods suffer from non-identifiability, while model-free methods based on the discrete-time optimal Bellman equation (Optimal-BE) suffer from large discretization errors that are highly sensitive to both the system dynamics and the reward structure. To overcome these challenges, we introduce Optimal-PhiBE, a formulation that integrates discrete-time information into a continuous-time PDE, combining the strength of both existing frameworks while mitigating their limitations. Optimal-PhiBE exhibits smaller discretization errors when the uncontrolled system evolves slowly, and demonstrates reduced sensitivity to oscillatory reward structures, and enables model-free algorithms that bypass explicit dynamics estimation. In the linear-quadratic regulator (LQR) setting, sharp error bounds are established for both Optimal-PhiBE and Optimal-BE. The results show that Optimal-PhiBE exactly recovers the optimal policy in the undiscounted case and substantially outperforms Optimal-BE when the problem is weakly discounted or control-dominant. Furthermore, we extend Optimal-PhiBE to higher orders, providing increasingly accurate approximations. A model-free policy iteration algorithm is proposed to solve the Optimal-PhiBE directly from trajectory data. Numerical experiments are conducted to verify the theoretical findings.

Paper Structure

This paper contains 62 sections, 14 theorems, 266 equations, 10 figures, 7 algorithms.

Key Result

Theorem 3.4

Under Assumption main ass, $\beta$ is large enough such that $L_\beta$ is positive and $i{\Delta t}\leq 3$, one has where $\mathcal{L}_{b,\Sigma}, h_i$ are defined in def of true hjb and Assumption main ass/(c). Specially, when $\Sigma \equiv 0$ (deterministic dynamics) with Assumption main ass/(a), (b), and $\left\lVert \nabla_s b \right\rVert_\infty < \beta$, one has where $\mathcal{L}_{b}$ i

Figures (10)

  • Figure 1: Value functions under the optimal policies derived from the MDP framework and our framework. Given the identical discrete-time transition dynamics, our framework recovers the optimal policy, while the MDP framework yields a significantly worse one.
  • Figure 2: Error decomposition for continuous-time RL
  • Figure 3: Unidentifiability issue for model-based optimal control given discrete-time information. The left plot show the trajectory $s_t, \hat{s}_t$ driven by the true $(A,B)$ and the estimated $(\hat{A},\hat{B})$. The right figure compares the optimal policy obtained from the estimated dynamics with the true optimal policy, and they are measured in terms of the value function under the true dynamics.
  • Figure 4: Comparison of the optimal policy error from Optimal-PhiBE and Optimal-BE. The left plot shows that Optimal-PhiBE exactly recovers the optimal policy when $\beta = 0$. The second plot illustrates how the reward and dynamics influence the error of Optimal-BE. The third plot shows how the discount coefficient $\beta$ affects the errors. The right plot demonstrates that when $\beta > 0$, both Optimal-PhiBE and Optimal-BE achieve first-order approximation with respect to $\Delta t$, while the second-order PhiBE achieves second-order approximation.
  • Figure 5: Comparison in the one-dimensional deterministic case. (A) Case 1, where $\Delta t$ is large. (B) Case 2, where $|A/B|$ is large. (C) Case 3, where $|Q/R|$ is large. (D) Case 4, where $|A|$ is large.
  • ...and 5 more figures

Theorems & Definitions (37)

  • Remark 2.1
  • Definition 3.1: PhiBE, zhu2024phibe
  • Definition 3.2: Optimal-PhiBE
  • Remark 3.3
  • Theorem 3.4
  • Theorem 3.5
  • Proposition 4.1
  • Remark 4.2
  • Theorem 4.3
  • proof
  • ...and 27 more