Table of Contents
Fetching ...

Gaining efficiency in deep policy gradient method for continuous-time optimal control problems

Arash Fahim, Md. Arafatur Rahman

TL;DR

This work tackles the high computational cost of applying policy gradient methods to continuous-time stochastic optimal control by introducing a multi-scale, deep PGM that begins with a coarse time discretization and progressively refines to finer scales. A dynamic-programming–inspired framework guides policy generalization across intervals, while separate neural networks at each scale manage resources and data, enabling efficient learning. A theoretical result characterizes how resource allocation across scales yields targeted efficiency gains, and numerical experiments on linear-quadratic stochastic control demonstrate dramatic speedups with preserved accuracy compared to brute-force, single-scale implementations. The approach offers a practical avenue for scalable, high-frequency optimal control and continuous-time RL problems where fine time discretization is required.

Abstract

In this paper, we propose an efficient implementation of deep policy gradient method (PGM) for optimal control problems in continuous time. The proposed method has the ability to manage the allocation of computational resources, number of trajectories, and complexity of architecture of the neural network. This is, in particular, important for continuous-time problems that require a fine time discretization. Each step of this method focuses on a different time scale and learns a policy, modeled by a neural network, for a discretized optimal control problem. The first step has the coarsest time discretization. As we proceed to other steps, the time discretization becomes finer. The optimal trained policy in each step is also used to provide data for the next step. We accompany the multi-scale deep PGM with a theoretical result on allocation of computational resources to obtain a targeted efficiency and test our methods on the linear-quadratic stochastic optimal control problem.

Gaining efficiency in deep policy gradient method for continuous-time optimal control problems

TL;DR

This work tackles the high computational cost of applying policy gradient methods to continuous-time stochastic optimal control by introducing a multi-scale, deep PGM that begins with a coarse time discretization and progressively refines to finer scales. A dynamic-programming–inspired framework guides policy generalization across intervals, while separate neural networks at each scale manage resources and data, enabling efficient learning. A theoretical result characterizes how resource allocation across scales yields targeted efficiency gains, and numerical experiments on linear-quadratic stochastic control demonstrate dramatic speedups with preserved accuracy compared to brute-force, single-scale implementations. The approach offers a practical avenue for scalable, high-frequency optimal control and continuous-time RL problems where fine time discretization is required.

Abstract

In this paper, we propose an efficient implementation of deep policy gradient method (PGM) for optimal control problems in continuous time. The proposed method has the ability to manage the allocation of computational resources, number of trajectories, and complexity of architecture of the neural network. This is, in particular, important for continuous-time problems that require a fine time discretization. Each step of this method focuses on a different time scale and learns a policy, modeled by a neural network, for a discretized optimal control problem. The first step has the coarsest time discretization. As we proceed to other steps, the time discretization becomes finer. The optimal trained policy in each step is also used to provide data for the next step. We accompany the multi-scale deep PGM with a theoretical result on allocation of computational resources to obtain a targeted efficiency and test our methods on the linear-quadratic stochastic optimal control problem.

Paper Structure

This paper contains 17 sections, 3 theorems, 55 equations, 4 figures.

Key Result

Theorem 2.1

Under the assumptions A.1 and A.2 in Section sec:error, there exists a $C>0$ independent of $N$ such that where $V(t,x)$ is the value function of the continuous-time control problem prob:control given by where $\Uppi_t$ is the set of admissible controls restricted on $[t,T]$. Constant $C$ depends only on $T$ and Lipschitz constant for $\mu$, $\sigma$, $L$, and $g$.

Figures (4)

  • Figure 1: The choice of parameters for the LQSC problem as well as the the first (coarse) and second step (s2) of multi-scale method and the brute-force (bf) method.
  • Figure 2: Comparison of the cost function between $2$-fold PGM and brute-force PGM with the closed-form solutions in for $100$ time steps in 10 independent runs.
  • Figure 3: The choice of parameters for the LQSC problem as well as the the first (coarse) and second and third steps (s2 and s3) of multi-scale method and the brute-force (bf) method.
  • Figure 4: Comparison of the cost function between $3$-fold PGM and brute-force PGM with the closed-form solutions in for $125$ time steps in 10 independent runs.

Theorems & Definitions (8)

  • Theorem 2.1
  • Proposition 2.1
  • Remark 2.1: Brute-force PGM
  • Theorem 3.1
  • proof
  • Example 3.1
  • Example 3.2
  • Remark 4.1: Choice of parameters