Table of Contents
Fetching ...

Policy Iteration for Exploratory Hamilton--Jacobi--Bellman Equations

Hung Vinh Tran, Zhenhua Wang, Yuming Paul Zhang

TL;DR

The paper develops a rigorous convergence theory for policy iteration applied to entropy-regularized exploratory HJB equations on infinite horizon with large discount. It establishes uniform $C^{2,\alpha}$ regularity for PIA iterates under bounded coefficients with diffusion-control smallness, and proves convergence with quantitative rates; it also treats unbounded coefficients by proving well-posedness of the exploratory HJB and proving convergence of PIA using locally uniform $C^{1,\alpha}$ estimates. In the unbounded-diffusion setting with diffusion independent of control, interior $W^{2,p}$ estimates yield monotone convergence of the PIA iterates to the HJB solution, with the policy iterates converging to the optimal relaxed policy, all verified via viscosity-solution stability. Together, the results broaden the applicability of PIA to nonlinear, fully nonlinear HJB equations arising in entropy-regularized stochastic control and reinforcement learning, including regimes with unbounded data and diffusion terms.

Abstract

We study the policy iteration algorithm (PIA) for entropy-regularized stochastic control problems on an infinite time horizon with a large discount rate, focusing on two main scenarios. First, we analyze PIA with bounded coefficients where the controls applied to the diffusion term satisfy a smallness condition. We demonstrate the convergence of PIA based on a uniform $\mathcal{C}^{2,α}$ estimate for the value sequence generated by PIA, and provide a quantitative convergence analysis for this scenario. Second, we investigate PIA with unbounded coefficients but no control over the diffusion term. In this scenario, we first provide the well-posedness of the exploratory Hamilton--Jacobi--Bellman equation with linear growth coefficients and polynomial growth reward function. By such a well-posedess result we achieve PIA's convergence by establishing a quantitative locally uniform $\mathcal{C}^{1,α}$ estimates for the generated value sequence.

Policy Iteration for Exploratory Hamilton--Jacobi--Bellman Equations

TL;DR

The paper develops a rigorous convergence theory for policy iteration applied to entropy-regularized exploratory HJB equations on infinite horizon with large discount. It establishes uniform regularity for PIA iterates under bounded coefficients with diffusion-control smallness, and proves convergence with quantitative rates; it also treats unbounded coefficients by proving well-posedness of the exploratory HJB and proving convergence of PIA using locally uniform estimates. In the unbounded-diffusion setting with diffusion independent of control, interior estimates yield monotone convergence of the PIA iterates to the HJB solution, with the policy iterates converging to the optimal relaxed policy, all verified via viscosity-solution stability. Together, the results broaden the applicability of PIA to nonlinear, fully nonlinear HJB equations arising in entropy-regularized stochastic control and reinforcement learning, including regimes with unbounded data and diffusion terms.

Abstract

We study the policy iteration algorithm (PIA) for entropy-regularized stochastic control problems on an infinite time horizon with a large discount rate, focusing on two main scenarios. First, we analyze PIA with bounded coefficients where the controls applied to the diffusion term satisfy a smallness condition. We demonstrate the convergence of PIA based on a uniform estimate for the value sequence generated by PIA, and provide a quantitative convergence analysis for this scenario. Second, we investigate PIA with unbounded coefficients but no control over the diffusion term. In this scenario, we first provide the well-posedness of the exploratory Hamilton--Jacobi--Bellman equation with linear growth coefficients and polynomial growth reward function. By such a well-posedess result we achieve PIA's convergence by establishing a quantitative locally uniform estimates for the generated value sequence.
Paper Structure (10 sections, 13 theorems, 160 equations, 1 algorithm)

This paper contains 10 sections, 13 theorems, 160 equations, 1 algorithm.

Key Result

Lemma 2.1

Let $\rho\geq 1$, and let $v$ be a solution to Assume that $\tilde{\Sigma}\geq \mathbb{I}_d/C_0$ for some $C_0>0$. Then there exists an increasing function $\eta: [1,\infty)\to [1,\infty)$ independent of $\rho$ such that if we have

Theorems & Definitions (28)

  • Lemma 2.1
  • proof
  • Remark 2.1
  • Lemma 2.2
  • proof
  • Theorem 2.1
  • proof
  • Theorem 3.1
  • proof
  • Theorem 3.2
  • ...and 18 more