Table of Contents
Fetching ...

Physics-informed approach for exploratory Hamilton--Jacobi--Bellman equations via policy iterations

Yeongjong Kim, Namkyeong Cho, Minseok Kim, Yeoneung Kim

TL;DR

The paper tackles entropy-regularized stochastic control by solving the exploratory HJB equation with a fully mesh-free PINN-based policy iteration (PINN-SPI). It develops a rigorous $L^2$-energy framework and a three-term error decomposition that separates iteration, policy-network, and PDE-residual contributions, proving that total error remains bounded and that exact PI converges exponentially. The method demonstrates scalability to high-dimensional problems (e.g., 5D and 10D LQR) and robustness on nonlinear stochastic benchmarks (pendulum, cartpole), with monotonic improvement in the value function across iterations. The work offers a principled, mesh-free solver that integrates classical policy-iteration theory with modern operator-learning techniques, opening avenues for reliable high-dimensional stochastic control under entropy regularization.

Abstract

We propose a mesh-free policy iteration framework based on physics-informed neural networks (PINNs) for solving entropy-regularized stochastic control problems. The method iteratively alternates between soft policy evaluation and improvement using automatic differentiation and neural approximation, without relying on spatial discretization. We present a detailed $L^2$ error analysis that decomposes the total approximation error into three sources: iteration error, policy network error, and PDE residual error. The proposed algorithm is validated with a range of challenging control tasks, including high-dimensional linear-quadratic regulation in 5D and 10D, as well as nonlinear systems such as pendulum and cartpole problems. Numerical results confirm the scalability, accuracy, and robustness of our approach across both linear and nonlinear benchmarks.

Physics-informed approach for exploratory Hamilton--Jacobi--Bellman equations via policy iterations

TL;DR

The paper tackles entropy-regularized stochastic control by solving the exploratory HJB equation with a fully mesh-free PINN-based policy iteration (PINN-SPI). It develops a rigorous -energy framework and a three-term error decomposition that separates iteration, policy-network, and PDE-residual contributions, proving that total error remains bounded and that exact PI converges exponentially. The method demonstrates scalability to high-dimensional problems (e.g., 5D and 10D LQR) and robustness on nonlinear stochastic benchmarks (pendulum, cartpole), with monotonic improvement in the value function across iterations. The work offers a principled, mesh-free solver that integrates classical policy-iteration theory with modern operator-learning techniques, opening avenues for reliable high-dimensional stochastic control under entropy regularization.

Abstract

We propose a mesh-free policy iteration framework based on physics-informed neural networks (PINNs) for solving entropy-regularized stochastic control problems. The method iteratively alternates between soft policy evaluation and improvement using automatic differentiation and neural approximation, without relying on spatial discretization. We present a detailed error analysis that decomposes the total approximation error into three sources: iteration error, policy network error, and PDE residual error. The proposed algorithm is validated with a range of challenging control tasks, including high-dimensional linear-quadratic regulation in 5D and 10D, as well as nonlinear systems such as pendulum and cartpole problems. Numerical results confirm the scalability, accuracy, and robustness of our approach across both linear and nonlinear benchmarks.

Paper Structure

This paper contains 40 sections, 6 theorems, 71 equations, 5 figures, 2 algorithms.

Key Result

Lemma 1

Given $\tilde{r}, \tilde{b}, \tilde{\sigma} \in C_b^{2}(\mathbb{R}^d)$, assume that (i) $\rho > \frac{1}{2}B$, where $B := \|\nabla_x \cdot \tilde{b}(\cdot,u)\|_{L^\infty(\mathbb{R}^d)}$, (ii) $\tilde{\Sigma}(x)\succeq \frac{1}{C_0}I_d$, (iii) $\tilde{r} \in L^2(\mathbb{R}^d)$. Let $v$ be a unique c Then, $v \in C^2_b(\mathbb{R}^d)$ and we have the $L^2$ energy bound: Therefore,

Figures (5)

  • Figure 1: Comparison of PINN-SPI, SAC, and PPO on 5D and 10D stochastic LQR problems with compact action constraints.
  • Figure 2: Evaluation reward over training time for PINN-SPI on LQR tasks. The average total reward increases monotonically as policy iteration proceeds.
  • Figure 3: Comparison between PINN-SPI, SAC, and PPO on the cartpole and pendulum problems.
  • Figure 4: Comparison between our method and others, SAC and PPO.
  • Figure 5: Evaluation reward over training time for PINN-SPI on cartpole and cendulum tasks. The average total reward tends to increase as policy iteration proceeds.

Theorems & Definitions (8)

  • Lemma 1: Energy estimate for linear elliptic PDE with drift and diffusion
  • Proposition 1
  • Lemma 2
  • proof : Sketch of Proof
  • Theorem 1: Convergence of policy iteration
  • Theorem 2: $L^2$ error
  • Lemma 3: Local policy stability on $B_R$
  • proof : Proof of Theorem \ref{['thm:expanded']}