Fisher-Rao Gradient Flows of Linear Programs and State-Action Natural Policy Gradients

Johannes Müller; Semih Çaycı; Guido Montúfar

Fisher-Rao Gradient Flows of Linear Programs and State-Action Natural Policy Gradients

Johannes Müller, Semih Çaycı, Guido Montúfar

TL;DR

This paper analyzes Fisher-Rao gradient flows on the state-action distributions of linear programs, revealing linear convergence with a rate governed by the geometry of the feasible region and providing sharper bounds on the entropic regularization error. It then extends the framework to natural gradient flows in parameterized measure spaces, establishing sublinear convergence under inexact gradients and distribution mismatch, as well as conditions for global linear convergence in multi-player games. The results directly apply to state-action natural policy gradients in MDPs, yielding sublinear convergence under general parametrizations and linear convergence for regular tabular policies, while revealing an implicit maximal-entropy bias when optimizers are not unique. The theoretical findings are complemented by computational examples that illustrate convergence behavior and demonstrate the practical relevance for RL algorithms leveraging entropy-regularized objectives and FR-based natural gradients.

Abstract

Kakade's natural policy gradient method has been studied extensively in recent years, showing linear convergence with and without regularization. We study another natural gradient method based on the Fisher information matrix of the state-action distributions which has received little attention from the theoretical side. Here, the state-action distributions follow the Fisher-Rao gradient flow inside the state-action polytope with respect to a linear potential. Therefore, we study Fisher-Rao gradient flows of linear programs more generally and show linear convergence with a rate that depends on the geometry of the linear program. Equivalently, this yields an estimate on the error induced by entropic regularization of the linear program which improves existing results. We extend these results and show sublinear convergence for perturbed Fisher-Rao gradient flows and natural gradient flows up to an approximation error. In particular, these general results cover the case of state-action natural policy gradients.

Fisher-Rao Gradient Flows of Linear Programs and State-Action Natural Policy Gradients

TL;DR

Abstract

Paper Structure (20 sections, 23 theorems, 113 equations, 4 figures)

This paper contains 20 sections, 23 theorems, 113 equations, 4 figures.

Introduction
Contributions
Related works
Notation and terminology
Preliminaries on Fisher-Rao Gradient Flows
Convergence of Fisher-Rao Gradient Flows
Convergence of Fisher-Rao Gradient Flows
Estimating the regularization error
Non-unique maximizers
Convergence of Natural Gradient Flows
Compatible function approximation
Convergence of natural gradient flows
Global convergence for multi-player games
Convergence of State-Action Natural Policy Gradients
Convergence guarantees
...and 5 more sections

Key Result

Proposition 2.2

Consider Setting setting:generalLP. Then $\mu_t$ is uniquely characterized by

Figures (4)

Figure 1: Visualization of the suboptimality gap $\Delta$ appearing in \ref{['thm:convergenceFRGF']} associated to the linear program \ref{['eq:LP-inside-simplex']}; note that $\Delta$ deteriorates when $c$ is almost orthogonal to a face of $P$.
Figure 2: Transition graph and reward of the MDP example.
Figure 3: Shown are the suboptimality gap $R^\star - R(\theta_t)$ (top row) and the KL-divergence $D_{\operatorname{KL}} (d^\star, d_t)$ (bottom row) for the state-action NPG (left column) and Kakade's NPG (right column) plotted in a logarithmic scale, along with the predicted exponential decay $e^{-\Delta \eta k} = e^{-\Delta_{\operatorname{K}} \eta k}$ (dashed line), see \ref{['thm:linearConvergenceTabular']} and khodadadian2022linear for state-action and Kakade's NPG, respectively.
Figure 4: Shown are the suboptimality $R^\star - R(\theta_k)$ (top) and KL-divergence $D_{\operatorname{KL}} (d^\star, d_{\theta_k})$ (bottom) for the state-action NPG (left) and Kakade's NPG (right); shown are also the guaranteed exponential decay rates $e^{-\Delta \eta k}$ for the state-action NPG (dashed line) and $e^{-\Delta_{\operatorname{K}} \eta k}$ for Kakade's NPG (dotted line). Although the guarantees are different, both methods exhibit the same fast decay rate.

Theorems & Definitions (57)

Proposition 2.2: Central path property,alvarez2004hessian
proof
Corollary 2.3: Sublinear convergence rate,alvarez2004hessian
proof
Theorem 2.4: Well-posedness of FR GFs,alvarez2004hessian
proof
Theorem 3.1: Linear convergence of Fisher-Rao GFs of LPs
Corollary 3.1: Entropic regularization error
Remark 3.2: Comparison with existing results
Example 3.3: Arbitrarily large improvement
...and 47 more

Fisher-Rao Gradient Flows of Linear Programs and State-Action Natural Policy Gradients

TL;DR

Abstract

Fisher-Rao Gradient Flows of Linear Programs and State-Action Natural Policy Gradients

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (57)