Table of Contents
Fetching ...

A Temporal Difference Method for Stochastic Continuous Dynamics

Haruki Settai, Naoya Takeishi, Takehisa Yairi

TL;DR

This work develops a model-free differential temporal difference (dTD) method for stochastic continuous dynamics by deriving a TD update from the Hamilton-Jacobi-Bellman equation through Itô expansion. The approach preserves the continuity information of the system even when learning from samples, enabling policy evaluation without explicit knowledge of the dynamics coefficients $\mu$ and $\sigma$. The authors prove exponential convergence of the idealized continuous-time dynamics to the fixed point and validate the method on continuous-control tasks, introducing a stabilizing $\beta$-dTD variant that improves learning speed and robustness under process noise. The results bridge stochastic control and model-free reinforcement learning, offering a scalable framework for leveraging continuity in continuous-time RL. The work also provides practical guidance for adapting the method to discrete-time environments and discusses limitations and avenues for future work in stability and variance reduction.

Abstract

For continuous systems modeled by dynamical equations such as ODEs and SDEs, Bellman's Principle of Optimality takes the form of the Hamilton-Jacobi-Bellman (HJB) equation, which provides the theoretical target of reinforcement learning (RL). Although recent advances in RL successfully leverage this formulation, the existing methods typically assume the underlying dynamics are known a priori because they need explicit access to the coefficient functions of dynamical equations to update the value function following the HJB equation. We address this inherent limitation of HJB-based RL; we propose a model-free approach still targeting the HJB equation and propose the corresponding temporal difference method. We establish exponential convergence of the idealized continuous-time dynamics and empirically demonstrate its potential advantages over transition-kernel-based formulations. The proposed formulation paves the way toward bridging stochastic control and model-free reinforcement learning.

A Temporal Difference Method for Stochastic Continuous Dynamics

TL;DR

This work develops a model-free differential temporal difference (dTD) method for stochastic continuous dynamics by deriving a TD update from the Hamilton-Jacobi-Bellman equation through Itô expansion. The approach preserves the continuity information of the system even when learning from samples, enabling policy evaluation without explicit knowledge of the dynamics coefficients and . The authors prove exponential convergence of the idealized continuous-time dynamics to the fixed point and validate the method on continuous-control tasks, introducing a stabilizing -dTD variant that improves learning speed and robustness under process noise. The results bridge stochastic control and model-free reinforcement learning, offering a scalable framework for leveraging continuity in continuous-time RL. The work also provides practical guidance for adapting the method to discrete-time environments and discusses limitations and avenues for future work in stability and variance reduction.

Abstract

For continuous systems modeled by dynamical equations such as ODEs and SDEs, Bellman's Principle of Optimality takes the form of the Hamilton-Jacobi-Bellman (HJB) equation, which provides the theoretical target of reinforcement learning (RL). Although recent advances in RL successfully leverage this formulation, the existing methods typically assume the underlying dynamics are known a priori because they need explicit access to the coefficient functions of dynamical equations to update the value function following the HJB equation. We address this inherent limitation of HJB-based RL; we propose a model-free approach still targeting the HJB equation and propose the corresponding temporal difference method. We establish exponential convergence of the idealized continuous-time dynamics and empirically demonstrate its potential advantages over transition-kernel-based formulations. The proposed formulation paves the way toward bridging stochastic control and model-free reinforcement learning.

Paper Structure

This paper contains 40 sections, 3 theorems, 49 equations, 2 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

When a stochastic process $(S_t)_{t \geq 0}$ follows the SDE in eq:sde, we have and

Figures (2)

  • Figure 1: Qualitative difference between the typical TD method and the proposed dTD method; the objects in red indicate what is adjusted by each temporal difference. (Left) In the typical TD method, the values of $\hat{V}$ are adjusted to minimize the TD error. (Right) In the dTD method, the gradient and the second derivative of $\hat{V}$ at $s_t$ are adjusted to minimize the dTD error.
  • Figure 2: Performance of TD, $\beta$-naive-dTD, and $\beta$-dTD on continuous control benchmark. Each column corresponds to different noise levels ($\text{coef} = 0.00, 0.01, 0.05$), and each row corresponds to different environments. The tuned $\beta$ values for $\beta$-naive-dTD were 0.08, 0.07, 0.23, 0.02 and for $\beta$-dTD were 0.57, 0.74, 0.24, 0.33 in Hopper, HalfCheetah, Ant, and Humanoid, respectively.

Theorems & Definitions (10)

  • Proposition 1
  • proof
  • Remark 1
  • Definition 1: differential temporal difference
  • Definition 2: HJB operator under a fixed policy
  • Definition 3: Infinitesimal Generator
  • Lemma 1: Existence and Uniqueness of the Fixed Point
  • Proposition 2: Exponential Stability of the Dynamics
  • proof
  • proof