Reusing Trajectories in Policy Gradients Enables Fast Convergence
Alessandro Montenegro, Federico Mansutti, Marco Mussi, Matteo Papini, Alberto Maria Metelli
TL;DR
The paper addresses the sample inefficiency of on-policy policy gradient methods by introducing RT-PG, which reuses trajectories from the most recent $\omega$ iterations through a novel power-mean corrected MIW estimator (MPM). The estimator blends fresh on-policy data with off-policy trajectories, achieves high-probability concentration, and yields a convergence rate of $\widetilde{O}(\epsilon^{-2}\omega^{-1})$ for partial reuse and $\widetilde{O}(\epsilon^{-1})$ for full reuse. Theoretical results combine martingale concentration (Freedman) with a covering argument to control target and PM biases, while experiments on MuJoCo environments validate substantial improvements in sample efficiency over strong baselines. The work demonstrates that carefully designed trajectory reuse can realize state-of-the-art rates for policy gradient methods while maintaining practical memory and computation considerations.
Abstract
Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. They rely on fresh on-policy data, making them sample-inefficient and requiring $O(ε^{-2})$ trajectories to reach an $ε$-approximate stationary point. A common strategy to improve efficiency is to reuse information from past iterations, such as previous gradients or trajectories, leading to off-policy PG methods. While gradient reuse has received substantial attention, leading to improved rates up to $O(ε^{-3/2})$, the reuse of past trajectories, although intuitive, remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that reusing past off-policy trajectories can significantly accelerate PG convergence. We propose RT-PG (Reusing Trajectories - Policy Gradient), a novel algorithm that leverages a power mean-corrected multiple importance weighting estimator to effectively combine on-policy and off-policy data coming from the most recent $ω$ iterations. Through a novel analysis, we prove that RT-PG achieves a sample complexity of $\widetilde{O}(ε^{-2}ω^{-1})$. When reusing all available past trajectories, this leads to a rate of $\widetilde{O}(ε^{-1})$, the best known one in the literature for PG methods. We further validate our approach empirically, demonstrating its effectiveness against baselines with state-of-the-art rates.
