Policy Gradient with Active Importance Sampling

Matteo Papini; Giorgio Manganini; Alberto Maria Metelli; Marcello Restelli

Policy Gradient with Active Importance Sampling

Matteo Papini, Giorgio Manganini, Alberto Maria Metelli, Marcello Restelli

TL;DR

The paper addresses high-variance off-policy policy gradient estimation by introducing Behavioral Policy Optimization (BPO), an active IS approach that learns the behavioral policy to minimize gradient estimator variance. It proposes a two-phase algorithm: (i) cross-entropy-based learning of a minimum-variance behavioral policy and (ii) off-policy policy gradient updates using defensive IS, with both theoretical (variance and convergence-rate) and practical (sample reuse) treatments. A key result is a convergence rate of $O(\epsilon^{-4})$ to a stationary point under KL-control and exponential-family policy assumptions, with potential improvements to $O(\epsilon^{-10/3})$ when residual variance is small. Empirical validation on LQ and Cartpole demonstrates substantial variance reduction and faster learning, supporting the practicality of active IS for policy optimization. The work advances variance-reduction techniques for policy gradients and sets the stage for extensions with SVRPG and deep architectures.

Abstract

Importance sampling (IS) represents a fundamental technique for a large surge of off-policy reinforcement learning approaches. Policy gradient (PG) methods, in particular, significantly benefit from IS, enabling the effective reuse of previously collected samples, thus increasing sample efficiency. However, classically, IS is employed in RL as a passive tool for re-weighting historical samples. However, the statistical community employs IS as an active tool combined with the use of behavioral distributions that allow the reduction of the estimate variance even below the sample mean one. In this paper, we focus on this second setting by addressing the behavioral policy optimization (BPO) problem. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance as much as possible. We provide an iterative algorithm that alternates between the cross-entropy estimation of the minimum-variance behavioral policy and the actual policy optimization, leveraging on defensive IS. We theoretically analyze such an algorithm, showing that it enjoys a convergence rate of order $O(ε^{-4})$ to a stationary point, but depending on a more convenient variance term w.r.t. standard PG methods. We then provide a practical version that is numerically validated, showing the advantages in the policy gradient estimation variance and on the learning speed.

Policy Gradient with Active Importance Sampling

TL;DR

to a stationary point under KL-control and exponential-family policy assumptions, with potential improvements to

when residual variance is small. Empirical validation on LQ and Cartpole demonstrates substantial variance reduction and faster learning, supporting the practicality of active IS for policy optimization. The work advances variance-reduction techniques for policy gradients and sets the stage for extensions with SVRPG and deep architectures.

Abstract

to a stationary point, but depending on a more convenient variance term w.r.t. standard PG methods. We then provide a practical version that is numerically validated, showing the advantages in the policy gradient estimation variance and on the learning speed.

Paper Structure (23 sections, 22 theorems, 94 equations, 1 figure, 1 table, 1 algorithm)

This paper contains 23 sections, 22 theorems, 94 equations, 1 figure, 1 table, 1 algorithm.

Introduction
Preliminaries
Behavioral Policy Optimization
Closed-form solution
Cross-entropy minimization
Theoretical Analysis
Behavior Policy Optimization Oracle
Convergence Rate
Related Works
Numerical Simulations
Practical Algorithm
Experimental Results
Discussion and Conclusions
Hellinger Distance
Omitted Proofs
...and 8 more sections

Key Result

Theorem 1

Let ${\bm{\theta}} \in \bm{\Theta}$ and $\mathbf{g}_{{\bm{\theta}}} : \bm{\mathcal{T}} \rightarrow \mathbb{R}_{}^d$ be the single-trajectory gradient estimator used to compute $\widehat{\nabla} J({\bm{\theta}}; \bm{\tau})$. The solution $p_{*,{\bm{\theta}}} \in \Delta^{\bm{\mathcal{T}}}$ to the BPO The optimal value of Equation (eq:opt) is given by:

Figures (1)

Figure 1: Cartpole. Average return and its 95% Gaussian CI (30 repetitions) over the learning iterations. Different policy gradient batch-sizes were used: (a) $N_{\mathrm{PG}} = 5$, (b) $N_{\mathrm{PG}} = 10$, (c) $N_{\mathrm{PG}} = 20$, (d) $N_{\mathrm{PG}} = 50$, (e) $N_{\mathrm{PG}} = 100$.

Theorems & Definitions (37)

Theorem 1
Proposition 1
Lemma 1
Theorem 2
Remark 4.1
Theorem 3
Lemma 2
Theorem 4
Corollary 1
Remark 4.2
...and 27 more

Policy Gradient with Active Importance Sampling

TL;DR

Abstract

Policy Gradient with Active Importance Sampling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (37)