A Temporal Difference Method for Stochastic Continuous Dynamics
Haruki Settai, Naoya Takeishi, Takehisa Yairi
TL;DR
This work develops a model-free differential temporal difference (dTD) method for stochastic continuous dynamics by deriving a TD update from the Hamilton-Jacobi-Bellman equation through Itô expansion. The approach preserves the continuity information of the system even when learning from samples, enabling policy evaluation without explicit knowledge of the dynamics coefficients $\mu$ and $\sigma$. The authors prove exponential convergence of the idealized continuous-time dynamics to the fixed point and validate the method on continuous-control tasks, introducing a stabilizing $\beta$-dTD variant that improves learning speed and robustness under process noise. The results bridge stochastic control and model-free reinforcement learning, offering a scalable framework for leveraging continuity in continuous-time RL. The work also provides practical guidance for adapting the method to discrete-time environments and discusses limitations and avenues for future work in stability and variance reduction.
Abstract
For continuous systems modeled by dynamical equations such as ODEs and SDEs, Bellman's Principle of Optimality takes the form of the Hamilton-Jacobi-Bellman (HJB) equation, which provides the theoretical target of reinforcement learning (RL). Although recent advances in RL successfully leverage this formulation, the existing methods typically assume the underlying dynamics are known a priori because they need explicit access to the coefficient functions of dynamical equations to update the value function following the HJB equation. We address this inherent limitation of HJB-based RL; we propose a model-free approach still targeting the HJB equation and propose the corresponding temporal difference method. We establish exponential convergence of the idealized continuous-time dynamics and empirically demonstrate its potential advantages over transition-kernel-based formulations. The proposed formulation paves the way toward bridging stochastic control and model-free reinforcement learning.
