Policy Iteration for Exploratory Hamilton--Jacobi--Bellman Equations

Hung Vinh Tran; Zhenhua Wang; Yuming Paul Zhang

Policy Iteration for Exploratory Hamilton--Jacobi--Bellman Equations

Hung Vinh Tran, Zhenhua Wang, Yuming Paul Zhang

TL;DR

The paper develops a rigorous convergence theory for policy iteration applied to entropy-regularized exploratory HJB equations on infinite horizon with large discount. It establishes uniform $C^{2,\alpha}$ regularity for PIA iterates under bounded coefficients with diffusion-control smallness, and proves convergence with quantitative rates; it also treats unbounded coefficients by proving well-posedness of the exploratory HJB and proving convergence of PIA using locally uniform $C^{1,\alpha}$ estimates. In the unbounded-diffusion setting with diffusion independent of control, interior $W^{2,p}$ estimates yield monotone convergence of the PIA iterates to the HJB solution, with the policy iterates converging to the optimal relaxed policy, all verified via viscosity-solution stability. Together, the results broaden the applicability of PIA to nonlinear, fully nonlinear HJB equations arising in entropy-regularized stochastic control and reinforcement learning, including regimes with unbounded data and diffusion terms.

Abstract

We study the policy iteration algorithm (PIA) for entropy-regularized stochastic control problems on an infinite time horizon with a large discount rate, focusing on two main scenarios. First, we analyze PIA with bounded coefficients where the controls applied to the diffusion term satisfy a smallness condition. We demonstrate the convergence of PIA based on a uniform $\mathcal{C}^{2,α}$ estimate for the value sequence generated by PIA, and provide a quantitative convergence analysis for this scenario. Second, we investigate PIA with unbounded coefficients but no control over the diffusion term. In this scenario, we first provide the well-posedness of the exploratory Hamilton--Jacobi--Bellman equation with linear growth coefficients and polynomial growth reward function. By such a well-posedess result we achieve PIA's convergence by establishing a quantitative locally uniform $\mathcal{C}^{1,α}$ estimates for the generated value sequence.

Policy Iteration for Exploratory Hamilton--Jacobi--Bellman Equations

TL;DR

The paper develops a rigorous convergence theory for policy iteration applied to entropy-regularized exploratory HJB equations on infinite horizon with large discount. It establishes uniform

regularity for PIA iterates under bounded coefficients with diffusion-control smallness, and proves convergence with quantitative rates; it also treats unbounded coefficients by proving well-posedness of the exploratory HJB and proving convergence of PIA using locally uniform

estimates. In the unbounded-diffusion setting with diffusion independent of control, interior

estimates yield monotone convergence of the PIA iterates to the HJB solution, with the policy iterates converging to the optimal relaxed policy, all verified via viscosity-solution stability. Together, the results broaden the applicability of PIA to nonlinear, fully nonlinear HJB equations arising in entropy-regularized stochastic control and reinforcement learning, including regimes with unbounded data and diffusion terms.

Abstract

estimate for the value sequence generated by PIA, and provide a quantitative convergence analysis for this scenario. Second, we investigate PIA with unbounded coefficients but no control over the diffusion term. In this scenario, we first provide the well-posedness of the exploratory Hamilton--Jacobi--Bellman equation with linear growth coefficients and polynomial growth reward function. By such a well-posedess result we achieve PIA's convergence by establishing a quantitative locally uniform

estimates for the generated value sequence.

Paper Structure (10 sections, 13 theorems, 160 equations, 1 algorithm)

This paper contains 10 sections, 13 theorems, 160 equations, 1 algorithm.

Introduction
Model formulation
Organization of the paper
Uniform ${\mathcal{C}}^{2,\alpha}$ estimates for bounded equations
Convergence for PIA with bounded coefficients
Convergence for uniform ${\mathcal{C}}^{2,\alpha}$ solutions
Quantitative convergence results
Unbounded degenerate elliptic equations
Existence and uniqueness
Convergence of PIA with unbounded coefficients

Key Result

Lemma 2.1

Let $\rho\geq 1$, and let $v$ be a solution to Assume that $\tilde{\Sigma}\geq \mathbb{I}_d/C_0$ for some $C_0>0$. Then there exists an increasing function $\eta: [1,\infty)\to [1,\infty)$ independent of $\rho$ such that if we have

Theorems & Definitions (28)

Lemma 2.1
proof
Remark 2.1
Lemma 2.2
proof
Theorem 2.1
proof
Theorem 3.1
proof
Theorem 3.2
...and 18 more

Policy Iteration for Exploratory Hamilton--Jacobi--Bellman Equations

TL;DR

Abstract

Policy Iteration for Exploratory Hamilton--Jacobi--Bellman Equations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (28)