Policy Iteration Achieves Regularized Equilibrium under Time Inconsistency

Yu-Jui Huang; Xiang Yu; Keyu Zhang

Policy Iteration Achieves Regularized Equilibrium under Time Inconsistency

Yu-Jui Huang, Xiang Yu, Keyu Zhang

Abstract

For a general entropy-regularized time-inconsistent stochastic control problem, we design a policy iteration algorithm (PIA) and establish its convergence to an equilibrium policy with an exponential convergence rate. The design of the PIA is based on a coupled system of non-local partial differential equations, called the exploratory equilibrium Hamilton--Jacobi--Bellman (EEHJB) equation. As opposed to the standard time-consistent case, policy improvement fails in general and the target value function (now an equilibrium value function) is not even a priori known to exist. To overcome these, we prove that the value functions generated by the PIA form a Cauchy sequence in a specialized Banach space, hence admit a limit, and the rate of convergence is exponential, on the strength of the Bismut--Elworthy--Li formula of stochastic representation. The limiting value function is then shown to fulfill the EEHJB equation, and thus yields an equilibrium policy in a Gibbs form. Such convergence in value implies uniform convergence of the generated policies to the eventual equilibrium policy, again with an exponential rate. As a byproduct, the PIA gives a constructive proof of the global existence and uniqueness of a classical solution to our general EEHJB equation, whose well-posedness has not been explored in the literature.

Policy Iteration Achieves Regularized Equilibrium under Time Inconsistency

Abstract

Paper Structure (7 sections, 128 equations, 3 figures)

This paper contains 7 sections, 128 equations, 3 figures.

Introduction
Notations
Problem Formulation
Derivation of the EEHJB Equation
Policy Iteration Algorithm
The Convergence of the PIA
Numerical Examples

Figures (3)

Figure 1: Convergence of the policy sequence and value function under utility function (i).
Figure 2: Convergence of the policy sequence and value function under utility function (ii).
Figure 3: Convergence of the policy sequence and value function under utility function (iii).

Theorems & Definitions (4)

proof
proof
proof : Proof of Theorem \ref{['thm:convergence']}
proof

Policy Iteration Achieves Regularized Equilibrium under Time Inconsistency

Abstract

Policy Iteration Achieves Regularized Equilibrium under Time Inconsistency

Authors

Abstract

Table of Contents

Figures (3)

Theorems & Definitions (4)