Policy Iteration for Exploratory Hamilton--Jacobi--Bellman Equations
Hung Vinh Tran, Zhenhua Wang, Yuming Paul Zhang
TL;DR
The paper develops a rigorous convergence theory for policy iteration applied to entropy-regularized exploratory HJB equations on infinite horizon with large discount. It establishes uniform $C^{2,\alpha}$ regularity for PIA iterates under bounded coefficients with diffusion-control smallness, and proves convergence with quantitative rates; it also treats unbounded coefficients by proving well-posedness of the exploratory HJB and proving convergence of PIA using locally uniform $C^{1,\alpha}$ estimates. In the unbounded-diffusion setting with diffusion independent of control, interior $W^{2,p}$ estimates yield monotone convergence of the PIA iterates to the HJB solution, with the policy iterates converging to the optimal relaxed policy, all verified via viscosity-solution stability. Together, the results broaden the applicability of PIA to nonlinear, fully nonlinear HJB equations arising in entropy-regularized stochastic control and reinforcement learning, including regimes with unbounded data and diffusion terms.
Abstract
We study the policy iteration algorithm (PIA) for entropy-regularized stochastic control problems on an infinite time horizon with a large discount rate, focusing on two main scenarios. First, we analyze PIA with bounded coefficients where the controls applied to the diffusion term satisfy a smallness condition. We demonstrate the convergence of PIA based on a uniform $\mathcal{C}^{2,α}$ estimate for the value sequence generated by PIA, and provide a quantitative convergence analysis for this scenario. Second, we investigate PIA with unbounded coefficients but no control over the diffusion term. In this scenario, we first provide the well-posedness of the exploratory Hamilton--Jacobi--Bellman equation with linear growth coefficients and polynomial growth reward function. By such a well-posedess result we achieve PIA's convergence by establishing a quantitative locally uniform $\mathcal{C}^{1,α}$ estimates for the generated value sequence.
