Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parameteric Policies

Xiang Li; Nan Jiang; Yuheng Zhang

Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parameteric Policies

Xiang Li, Nan Jiang, Yuheng Zhang

TL;DR

When extending mirror descent to parameterized policies, this work identifies contextual coupling as the core difficulty, and shows how connecting mirror descent to natural policy gradient leads to novel analyses, guarantees, and algorithmic insights, including a surprising unification between offline RL and imitation learning.

Abstract

We investigate the theoretical aspects of offline reinforcement learning (RL) under general function approximation. While prior works (e.g., Xie et al., 2021) have established the theoretical foundations of learning a good policy from offline data via pessimism, existing algorithms that are computationally tractable (often in an oracle-efficient sense), such as PSPI, only apply to finite and small action spaces. Moreover, these algorithms rely on state-wise mirror descent and require actors to be implicitly induced from the critic functions, failing to accommodate standalone policy parameterization which is ubiquitous in practice. In this work, we address these limitations and extend the theoretical guarantees to parameterized policy classes over large or continuous action spaces. When extending mirror descent to parameterized policies, we identify contextual coupling as the core difficulty, and show how connecting mirror descent to natural policy gradient leads to novel analyses, guarantees, and algorithmic insights, including a surprising unification between offline RL and imitation learning.

Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parameteric Policies

TL;DR

Abstract

Paper Structure (52 sections, 30 theorems, 231 equations, 1 figure, 4 algorithms)

This paper contains 52 sections, 30 theorems, 231 equations, 1 figure, 4 algorithms.

Introduction
Preliminary
Notation.
Markov Decision Processes.
Policy Optimization.
Offline RL.
Pessimistic Soft Policy Iteration as State-wise Mirror Descent
Review of PSPI Algorithm
Mirror Descent with General Action Space
Contextual Coupling in Parameterized Policy Optimization
Why Contextual Mirror Descent Breaks
Regret Decomposition via Compatible Function Approximation
Constructing Unified Policy Updates in Parameter Space
Least Square Policy Update
Error transfer through the coverage condition.
...and 37 more sections

Key Result

Theorem 1

Under Assumptions ass:action and ass:continuous-policy, PSPI (Algorithm alg:pspi) with step size $\eta=\sqrt{8D_\textup{KL}(\pi_\textup{cp}\|\pi_1)/(KV_\textup{max}^2)}$ achieves where $D_\textup{KL}(\pi_\textup{cp}\|\pi)=\mathbb{E}_{s\sim d^{\pi_\textup{cp}}}[D_\textup{KL}(\pi_\textup{cp}(\cdot|s)\|\pi(\cdot|s))]$ is taken under $d^{\pi_\textup{cp}}$ by default, unless otherwise specified.

Figures (1)

Figure 1: Comparison between LSPU and DRPU under no-shift setting ($d^D=d^{\pi_\textup{cp}}$). Left: Performance $J(\pi_k)$ over iterations, where DRPU converges to the comparator policy $\pi_\textup{cp}$ (not optimal), while LSPU plateaus at a worse policy. Right: The error of CFA, $\textup{err}_k$, at iteration $k=80$ on a log scale, showing that DRPU drives the error close to zero, whereas LSPU incurs a non-vanishing error.

Theorems & Definitions (50)

Theorem 1: Regret Bound of Algorithm \ref{['alg:pspi']} in General Action Space
Proposition 2: Failure for Contextual Mirror Descent
Lemma 3: Regret Decomposition Lemma
Theorem 4: Main Theorem for LSPU
Theorem 5: Main Theorem for DRPU under $\mathcal{W}_\infty$ Class
proof : Proof of Theorem \ref{['thm:continuous-pspi']}
Theorem 6: Unified KL Bound with Convex Action Space for Theorem \ref{['thm:continuous-pspi']}
proof : Proof of Theorem \ref{['thm:continuous-pspi-unified']}
proof : Proof of Proposition \ref{['thm:hardness']}
Proposition 7: No actor-critic incompatibility in the hardness construction
...and 40 more

Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parameteric Policies

TL;DR

Abstract

Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parameteric Policies

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (50)