Dual Approximation Policy Optimization

Zhihan Xiong; Maryam Fazel; Lin Xiao

Dual Approximation Policy Optimization

Zhihan Xiong, Maryam Fazel, Lin Xiao

TL;DR

This duality framework has both theoretical and practical implications: not only does it achieve fast linear convergence with general function approximation, but it also includes several well-known practical methods as special cases, immediately providing strong convergence guarantees.

Abstract

We propose Dual Approximation Policy Optimization (DAPO), a framework that incorporates general function approximation into policy mirror descent methods. In contrast to the popular approach of using the $L_2$-norm to measure function approximation errors, DAPO uses the dual Bregman divergence induced by the mirror map for policy projection. This duality framework has both theoretical and practical implications: not only does it achieve fast linear convergence with general function approximation, but it also includes several well-known practical methods as special cases, immediately providing strong convergence guarantees.

Dual Approximation Policy Optimization

TL;DR

Abstract

-norm to measure function approximation errors, DAPO uses the dual Bregman divergence induced by the mirror map for policy projection. This duality framework has both theoretical and practical implications: not only does it achieve fast linear convergence with general function approximation, but it also includes several well-known practical methods as special cases, immediately providing strong convergence guarantees.

Paper Structure (39 sections, 19 theorems, 110 equations, 2 figures, 6 tables, 2 algorithms)

This paper contains 39 sections, 19 theorems, 110 equations, 2 figures, 6 tables, 2 algorithms.

Introduction
Preliminaries
Markov Decision Processes
Mirror Descent
Policy Optimization with Dual Function Approximation
Instantiations of DAPO
Comparison with AMPO, MDPO and FMA-PG
SAC as a special case of DAPO-KL
Convergence Analysis
Analysis of DAPO-KL
Analysis of SAC
Experiments
Conclusions
Related Work
PG and PMD in tabular MDPs.
...and 24 more sections

Key Result

Theorem 4.1

Consider Algorithm alg:compo with initial policy $\pi^{(0)}$, initial distribution $\rho\in\Delta(\mathcal{S})$ and $\Phi$ being the negative entropy restricted on $\Delta(\mathcal{A})$. Suppose Assumptions (A1), (A2) and (A3) hold and the step sizes satisfy $\eta_0>1$ and $\eta_{k+1}\geq\left(\vart where $\psi(x)=(1+C_\rho)\left(x+\sqrt{2x}\right)$ for $x\geq 0$.

Figures (2)

Figure 1: Average return curves on MuJoCo benchmarks. Each curve is averaged over 5 random seeds and the shaded area represents the $95\%$ confidence interval. Here $m$ represents the number of stochastic gradient steps in each policy update iteration.
Figure 2: Comparison under $m=1$ and $m=10$ gradient steps per iteration between MAMPO and variants of AMPO-KL. Here, "AMPO-Var-1" refers to Eq. \ref{['equ:ampo_var_1']} and "AMPO-Var-2" refers to Eq. \ref{['equ:ampo_var_2']}. Each curve is averaged over 5 different random seeds and the shaded area represents the $95\%$ confidence interval.

Theorems & Definitions (40)

Example 2.1: Squared $L_2$-norm
Example 2.2: Negative entropy on $\mathbb{R}^n_+$
Example 2.3: Negative entropy on $\Delta$
Remark 2.4
Theorem 4.1: Linear Convergence of -KL
Remark 4.2
Lemma 4.2: Modified Performance Difference Lemma
Theorem 4.3: Sublinear Convergence of SAC
Definition B.1
Lemma B.1
...and 30 more

Dual Approximation Policy Optimization

TL;DR

Abstract

Dual Approximation Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (40)