Table of Contents
Fetching ...

Functional Natural Policy Gradients

Aurelien Bibaut, Houssam Zenati, Thibaud Rahier, Nathan Kallus

Abstract

We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is $\sqrt N$ regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is $O(N^{-1/2})$. The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may be traded against the other.

Functional Natural Policy Gradients

Abstract

We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is . The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may be traded against the other.

Paper Structure

This paper contains 8 sections, 1 theorem, 36 equations, 1 algorithm.

Key Result

Theorem 1

Assume $\Pi$ is convex and that $t_1$ is an interior maximizer of and $\hat{\pi}_\star$ defined as in eq:hatpistar. Then where where, for any $f : \mathcal{X} \times [K] \times [0,1] \to \mathbb{R}$, policy $\bar{\pi}$ and distribution $\bar{P}$ with domain $\mathcal{X} \times [K] \times [0,1]$, $\| f \|_{\bar{\pi}, \bar{P}} := (\bar{P} \{(\bar{\pi} / \pi_b) f^2 \})^{1/2}$.

Theorems & Definitions (13)

  • Definition 1: Semiparametric natural policy gradient
  • Remark 1
  • Remark 2
  • Definition 2: Natural policy gradient flow
  • Remark 3
  • Remark 4
  • Theorem 1
  • Remark 5
  • Remark 6
  • Remark 7
  • ...and 3 more