Table of Contents
Fetching ...

Deterministic Policy Gradient for Reinforcement Learning with Continuous Time and State

Ziheng Cheng, Xin Guo, Yufei Zhang

Abstract

The theory of continuous-time reinforcement learning (RL) has progressed rapidly in recent years. While the ultimate objective of RL is typically to learn deterministic control policies, most existing continuous-time RL methods rely on stochastic policies. Such approaches often require sampling actions at very high frequencies, and involve computationally expensive expectations over continuous action spaces, resulting in high-variance gradient estimates and slow convergence. In this paper, we introduce and develop deterministic policy gradient (DPG) methods for continuous-time RL. We derive a continuous-time policy gradient formula expressed as the expected gradient of an advantage rate function and establish a martingale characterization for both the value function and the advantage rate. These theoretical results provide tractable estimators for deterministic policy gradients in continuous-time RL. Building on this foundation, we propose a model-free continuous-time Deep Deterministic Policy Gradient (CT-DDPG) algorithm that enables stable learning for general reinforcement learning problems with continuous time-and-state. Numerical experiments show that CT-DDPG achieves superior stability and faster convergence compared to existing stochastic-policy methods, across a wide range of learning tasks with varying time discretizations and noise levels.

Deterministic Policy Gradient for Reinforcement Learning with Continuous Time and State

Abstract

The theory of continuous-time reinforcement learning (RL) has progressed rapidly in recent years. While the ultimate objective of RL is typically to learn deterministic control policies, most existing continuous-time RL methods rely on stochastic policies. Such approaches often require sampling actions at very high frequencies, and involve computationally expensive expectations over continuous action spaces, resulting in high-variance gradient estimates and slow convergence. In this paper, we introduce and develop deterministic policy gradient (DPG) methods for continuous-time RL. We derive a continuous-time policy gradient formula expressed as the expected gradient of an advantage rate function and establish a martingale characterization for both the value function and the advantage rate. These theoretical results provide tractable estimators for deterministic policy gradients in continuous-time RL. Building on this foundation, we propose a model-free continuous-time Deep Deterministic Policy Gradient (CT-DDPG) algorithm that enables stable learning for general reinforcement learning problems with continuous time-and-state. Numerical experiments show that CT-DDPG achieves superior stability and faster convergence compared to existing stochastic-policy methods, across a wide range of learning tasks with varying time discretizations and noise levels.

Paper Structure

This paper contains 39 sections, 11 theorems, 89 equations, 5 figures, 1 algorithm.

Key Result

Proposition 3.1

Suppose assum:wpassum:differentiability hold. For all $(t,x)\in [0,T]\times {\mathbb{R}}^n$ and $\phi \in {\mathbb{R}}^k$, where $A^\phi(t,x,a)\coloneqq \mathcal{L}[ V^\phi](t,x, a) + r(t, x, a)$.

Figures (5)

  • Figure 1: Model-aware LQ with linear policy.
  • Figure 2: Model-agnostic LQ with neural network parameterized policy.
  • Figure 3: Comparison between CT-DDPG with discrete-time RL algorithms.
  • Figure 4: Comparison between continuous-time RL algorithms.
  • Figure 5: Noise-to-Signal Ratio of stochastic gradient when training value-net $V_\theta$.

Theorems & Definitions (31)

  • Proposition 3.1
  • Remark 1
  • Remark 2
  • Theorem 3.2
  • Remark 3: Local exploration
  • Remark 4: Simplified Bellman equation
  • Remark 5
  • Remark 6
  • Example 1
  • Theorem 3.3
  • ...and 21 more