A Multi-Fidelity Control Variate Approach for Policy Gradient Estimation
Xinjie Liu, Cyrus Neary, Kushagra Gupta, Wesley A. Suttle, Christian Ellis, Ufuk Topcu, David Fridovich-Keil
TL;DR
This work addresses data efficiency for reinforcement learning by proposing Multi-Fidelity Policy Gradients (MFPG), which blends scarce high-fidelity data with abundant low-fidelity data through a control variate to create an unbiased, variance-reduced estimator for on-policy policy gradients. The method instantiates MFPG as a multi-fidelity REINFORCE algorithm that samples correlated trajectories across fidelities via a reparameterization-based approach and computes an optimal control variate coefficient to minimize variance. The authors provide theoretical convergence guarantees to a first-order stationary point in the high-fidelity environment and establish faster finite-sample rates when cross-fidelity correlation is nonzero. Empirically, MFPG yields robust variance reduction and improved median performance across mild-to-moderate dynamics gaps on MuJoCo robotics tasks, while remaining robust under large dynamics gaps and reward misspecification, highlighting its potential for efficient sim-to-real transfer and cost-effective data collection.
Abstract
Many reinforcement learning (RL) algorithms are impractical for deployment in operational systems or for training with computationally expensive high-fidelity simulations, as they require large amounts of data. Meanwhile, low-fidelity simulators -- such as reduced-order models, heuristic rewards, or generative world models -- can cheaply provide useful data for RL training, even if they are too coarse for zero-shot transfer. We propose multi-fidelity policy gradients (MFPGs), an RL framework that mixes a small amount of data from the target environment with a control variate formed from a large volume of low-fidelity simulation data to construct an unbiased, variance-reduced estimator for on-policy policy gradients. We instantiate the framework with a multi-fidelity variant of the classical REINFORCE algorithm. We show that under standard assumptions, the MFPG estimator guarantees asymptotic convergence of REINFORCE to locally optimal policies in the target environment, and achieves faster finite-sample convergence rates compared to training with high-fidelity data alone. Empirically, we evaluate the MFPG algorithm across a suite of simulated robotics benchmark tasks with limited high-fidelity data but abundant off-dynamics, low-fidelity data. With mild-moderate dynamics gaps, MFPG reliably improves the median performance over a high-fidelity-only baseline, matching the performance of leading multi-fidelity baselines despite its simplicity and minimal tuning overhead. Under large dynamics gaps, MFPG demonstrates the strongest robustness among the evaluated multi-fidelity approaches. An additional experiment shows that MFPG can remain effective even under low-fidelity reward misspecification. Thus, MFPG not only offers a novel paradigm for efficient sim-to-real transfer but also provides a principled approach to managing the trade-off between policy performance and data collection costs.
