Entropy annealing for policy mirror descent in continuous time and space

Deven Sethi; David Šiška; Yufei Zhang

Entropy annealing for policy mirror descent in continuous time and space

Deven Sethi, David Šiška, Yufei Zhang

TL;DR

This paper addresses the convergence of policy gradient methods for continuous-time, continuous-space exit-time control with entropy regularization. It introduces a continuous-time policy mirror descent flow that updates a Gibbs-policy feature $Z$ under a decaying entropy parameter $\tau$, and establishes well-posedness, monotone cost decrease, and explicit convergence rates. The authors show exponential convergence to the $\tau$-regularized optimum for fixed entropy, and quantify how annealing the entropy yields convergence to the unregularized problem at rates $O(1/S)$ for discrete action spaces and up to $O((\log S)^\alpha/\sqrt{S})$ for general action spaces (with logarithmic factors). An important insight is that entropy regularization can improve optimization even with the true gradient, by balancing optimization error and regularization bias. The work also outlines the corresponding results for annealing schedulers, discusses diffusionCoefficientExtensions, and provides a rigorous PDE-based analysis in an infinite-dimensional policy space of Markov kernels.

Abstract

Entropy regularization has been widely used in policy optimization algorithms to enhance exploration and the robustness of the optimal control; however it also introduces an additional regularization bias. This work quantifies the impact of entropy regularization on the convergence of policy gradient methods for stochastic exit time control problems. We analyze a continuous-time policy mirror descent dynamics, which updates the policy based on the gradient of an entropy-regularized value function and adjusts the strength of entropy regularization as the algorithm progresses. We prove that with a fixed entropy level, the mirror descent dynamics converges exponentially to the optimal solution of the regularized problem. We further show that when the entropy level decays at suitable polynomial rates, the annealed flow converges to the solution of the unregularized problem at a rate of $\mathcal O(1/S)$ for discrete action spaces and, under suitable conditions, at a rate of $\mathcal O(1/\sqrt{S})$ for general action spaces, with $S$ being the gradient flow running time. The technical challenge lies in analyzing the gradient flow in the infinite-dimensional space of Markov kernels for nonconvex objectives. This paper explains how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate.

Entropy annealing for policy mirror descent in continuous time and space

TL;DR

under a decaying entropy parameter

, and establishes well-posedness, monotone cost decrease, and explicit convergence rates. The authors show exponential convergence to the

-regularized optimum for fixed entropy, and quantify how annealing the entropy yields convergence to the unregularized problem at rates

for discrete action spaces and up to

for general action spaces (with logarithmic factors). An important insight is that entropy regularization can improve optimization even with the true gradient, by balancing optimization error and regularization bias. The work also outlines the corresponding results for annealing schedulers, discusses diffusionCoefficientExtensions, and provides a rigorous PDE-based analysis in an infinite-dimensional policy space of Markov kernels.

Abstract

for discrete action spaces and, under suitable conditions, at a rate of

for general action spaces, with

being the gradient flow running time. The technical challenge lies in analyzing the gradient flow in the infinite-dimensional space of Markov kernels for nonconvex objectives. This paper explains how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate.

Paper Structure (18 sections, 31 theorems, 170 equations, 1 figure)

This paper contains 18 sections, 31 theorems, 170 equations, 1 figure.

Introduction
Outline of main results
Most related works
Notation
Problem formulation and main results
Relaxed control problem
Well-posedness of the mirror descent flow
Convergence of mirror descent for the regularized problem
Convergence of mirror descent with constant schedulers
Convergence of mirror descent with annealing schedulers
Discussion: controlled diffusion coefficients
Performance difference and regularity of cost functional
Proofs of Theorem \ref{['ref:cost_decrease_along_flow']}, Proposition \ref{['thm:convergence_of_GF']} and Theorem \ref{['cor:extend_conv_GF']}
Proof of Theorem \ref{['thm:convergece_tau']}
Proofs of Theorems \ref{['thm:conv_discrete_anneal']} and \ref{['thm:conv_general_anneal']}
...and 3 more sections

Key Result

Proposition 2.2

Suppose Assumption ass:data holds and $\tau>0$. Let $\pi\in \Pi_{\mu}$, and let $v^{\pi }_\tau$ be the associated value function given by eq:value. Then $v^{\pi}_\tau$ satisfies the Dirichlet problem eq:on_policy_bellman, $v^\pi_\tau\in W^{2,p^*}(\mathcal{O})$ with $p^*$ from Assumption ass:data, an

Figures (1)

Figure 1: The overall error $v_0^{\boldsymbol{\pi}(Z_S)} - v_0^\ast$ with annealing schedulers $\boldsymbol{\tau}_s = 1/(1+s)^\beta$, for different $\beta\in (0,1)$ and running horizon $S$.

Theorems & Definitions (55)

Proposition 2.2
Theorem 2.3
Theorem 2.4
Proposition 2.5
Proposition 2.6
Theorem 2.7
Corollary 2.8
Theorem 2.10
Theorem 2.11
Theorem 2.13
...and 45 more

Entropy annealing for policy mirror descent in continuous time and space

TL;DR

Abstract

Entropy annealing for policy mirror descent in continuous time and space

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (55)