Table of Contents
Fetching ...

Dual Approximation Policy Optimization

Zhihan Xiong, Maryam Fazel, Lin Xiao

TL;DR

This duality framework has both theoretical and practical implications: not only does it achieve fast linear convergence with general function approximation, but it also includes several well-known practical methods as special cases, immediately providing strong convergence guarantees.

Abstract

We propose Dual Approximation Policy Optimization (DAPO), a framework that incorporates general function approximation into policy mirror descent methods. In contrast to the popular approach of using the $L_2$-norm to measure function approximation errors, DAPO uses the dual Bregman divergence induced by the mirror map for policy projection. This duality framework has both theoretical and practical implications: not only does it achieve fast linear convergence with general function approximation, but it also includes several well-known practical methods as special cases, immediately providing strong convergence guarantees.

Dual Approximation Policy Optimization

TL;DR

This duality framework has both theoretical and practical implications: not only does it achieve fast linear convergence with general function approximation, but it also includes several well-known practical methods as special cases, immediately providing strong convergence guarantees.

Abstract

We propose Dual Approximation Policy Optimization (DAPO), a framework that incorporates general function approximation into policy mirror descent methods. In contrast to the popular approach of using the -norm to measure function approximation errors, DAPO uses the dual Bregman divergence induced by the mirror map for policy projection. This duality framework has both theoretical and practical implications: not only does it achieve fast linear convergence with general function approximation, but it also includes several well-known practical methods as special cases, immediately providing strong convergence guarantees.
Paper Structure (39 sections, 19 theorems, 110 equations, 2 figures, 6 tables, 2 algorithms)

This paper contains 39 sections, 19 theorems, 110 equations, 2 figures, 6 tables, 2 algorithms.

Key Result

Theorem 4.1

Consider Algorithm alg:compo with initial policy $\pi^{(0)}$, initial distribution $\rho\in\Delta(\mathcal{S})$ and $\Phi$ being the negative entropy restricted on $\Delta(\mathcal{A})$. Suppose Assumptions (A1), (A2) and (A3) hold and the step sizes satisfy $\eta_0>1$ and $\eta_{k+1}\geq\left(\vart where $\psi(x)=(1+C_\rho)\left(x+\sqrt{2x}\right)$ for $x\geq 0$.

Figures (2)

  • Figure 1: Average return curves on MuJoCo benchmarks. Each curve is averaged over 5 random seeds and the shaded area represents the $95\%$ confidence interval. Here $m$ represents the number of stochastic gradient steps in each policy update iteration.
  • Figure 2: Comparison under $m=1$ and $m=10$ gradient steps per iteration between MAMPO and variants of AMPO-KL. Here, "AMPO-Var-1" refers to Eq. \ref{['equ:ampo_var_1']} and "AMPO-Var-2" refers to Eq. \ref{['equ:ampo_var_2']}. Each curve is averaged over 5 different random seeds and the shaded area represents the $95\%$ confidence interval.

Theorems & Definitions (40)

  • Example 2.1: Squared $L_2$-norm
  • Example 2.2: Negative entropy on $\mathbb{R}^n_+$
  • Example 2.3: Negative entropy on $\Delta$
  • Remark 2.4
  • Theorem 4.1: Linear Convergence of -KL
  • Remark 4.2
  • Lemma 4.2: Modified Performance Difference Lemma
  • Theorem 4.3: Sublinear Convergence of SAC
  • Definition B.1
  • Lemma B.1
  • ...and 30 more