Table of Contents
Fetching ...

A Multi-Fidelity Control Variate Approach for Policy Gradient Estimation

Xinjie Liu, Cyrus Neary, Kushagra Gupta, Wesley A. Suttle, Christian Ellis, Ufuk Topcu, David Fridovich-Keil

TL;DR

This work addresses data efficiency for reinforcement learning by proposing Multi-Fidelity Policy Gradients (MFPG), which blends scarce high-fidelity data with abundant low-fidelity data through a control variate to create an unbiased, variance-reduced estimator for on-policy policy gradients. The method instantiates MFPG as a multi-fidelity REINFORCE algorithm that samples correlated trajectories across fidelities via a reparameterization-based approach and computes an optimal control variate coefficient to minimize variance. The authors provide theoretical convergence guarantees to a first-order stationary point in the high-fidelity environment and establish faster finite-sample rates when cross-fidelity correlation is nonzero. Empirically, MFPG yields robust variance reduction and improved median performance across mild-to-moderate dynamics gaps on MuJoCo robotics tasks, while remaining robust under large dynamics gaps and reward misspecification, highlighting its potential for efficient sim-to-real transfer and cost-effective data collection.

Abstract

Many reinforcement learning (RL) algorithms are impractical for deployment in operational systems or for training with computationally expensive high-fidelity simulations, as they require large amounts of data. Meanwhile, low-fidelity simulators -- such as reduced-order models, heuristic rewards, or generative world models -- can cheaply provide useful data for RL training, even if they are too coarse for zero-shot transfer. We propose multi-fidelity policy gradients (MFPGs), an RL framework that mixes a small amount of data from the target environment with a control variate formed from a large volume of low-fidelity simulation data to construct an unbiased, variance-reduced estimator for on-policy policy gradients. We instantiate the framework with a multi-fidelity variant of the classical REINFORCE algorithm. We show that under standard assumptions, the MFPG estimator guarantees asymptotic convergence of REINFORCE to locally optimal policies in the target environment, and achieves faster finite-sample convergence rates compared to training with high-fidelity data alone. Empirically, we evaluate the MFPG algorithm across a suite of simulated robotics benchmark tasks with limited high-fidelity data but abundant off-dynamics, low-fidelity data. With mild-moderate dynamics gaps, MFPG reliably improves the median performance over a high-fidelity-only baseline, matching the performance of leading multi-fidelity baselines despite its simplicity and minimal tuning overhead. Under large dynamics gaps, MFPG demonstrates the strongest robustness among the evaluated multi-fidelity approaches. An additional experiment shows that MFPG can remain effective even under low-fidelity reward misspecification. Thus, MFPG not only offers a novel paradigm for efficient sim-to-real transfer but also provides a principled approach to managing the trade-off between policy performance and data collection costs.

A Multi-Fidelity Control Variate Approach for Policy Gradient Estimation

TL;DR

This work addresses data efficiency for reinforcement learning by proposing Multi-Fidelity Policy Gradients (MFPG), which blends scarce high-fidelity data with abundant low-fidelity data through a control variate to create an unbiased, variance-reduced estimator for on-policy policy gradients. The method instantiates MFPG as a multi-fidelity REINFORCE algorithm that samples correlated trajectories across fidelities via a reparameterization-based approach and computes an optimal control variate coefficient to minimize variance. The authors provide theoretical convergence guarantees to a first-order stationary point in the high-fidelity environment and establish faster finite-sample rates when cross-fidelity correlation is nonzero. Empirically, MFPG yields robust variance reduction and improved median performance across mild-to-moderate dynamics gaps on MuJoCo robotics tasks, while remaining robust under large dynamics gaps and reward misspecification, highlighting its potential for efficient sim-to-real transfer and cost-effective data collection.

Abstract

Many reinforcement learning (RL) algorithms are impractical for deployment in operational systems or for training with computationally expensive high-fidelity simulations, as they require large amounts of data. Meanwhile, low-fidelity simulators -- such as reduced-order models, heuristic rewards, or generative world models -- can cheaply provide useful data for RL training, even if they are too coarse for zero-shot transfer. We propose multi-fidelity policy gradients (MFPGs), an RL framework that mixes a small amount of data from the target environment with a control variate formed from a large volume of low-fidelity simulation data to construct an unbiased, variance-reduced estimator for on-policy policy gradients. We instantiate the framework with a multi-fidelity variant of the classical REINFORCE algorithm. We show that under standard assumptions, the MFPG estimator guarantees asymptotic convergence of REINFORCE to locally optimal policies in the target environment, and achieves faster finite-sample convergence rates compared to training with high-fidelity data alone. Empirically, we evaluate the MFPG algorithm across a suite of simulated robotics benchmark tasks with limited high-fidelity data but abundant off-dynamics, low-fidelity data. With mild-moderate dynamics gaps, MFPG reliably improves the median performance over a high-fidelity-only baseline, matching the performance of leading multi-fidelity baselines despite its simplicity and minimal tuning overhead. Under large dynamics gaps, MFPG demonstrates the strongest robustness among the evaluated multi-fidelity approaches. An additional experiment shows that MFPG can remain effective even under low-fidelity reward misspecification. Thus, MFPG not only offers a novel paradigm for efficient sim-to-real transfer but also provides a principled approach to managing the trade-off between policy performance and data collection costs.

Paper Structure

This paper contains 18 sections, 3 theorems, 32 equations, 15 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

Under assumption: bounded rewardassumption: differentiable policyassumption: bounded score functionsassumption: lipschitz score functionsassumption: high-fidelity policy gradient estimateassumption: correlated fidelities, let $\{\theta_k^{MFPG}\}_{k\in\mathbb{Z}^+}$ and $\{\theta_k^h\}_{k\in\mathbb{ where $L_T$ is the Lipschitz constant of the high-fidelity policy gradient, established in lemma: p

Figures (15)

  • Figure 1: The proposed mfpg framework. At each policy update step, mfpg combines a small amount of data from the target (high-fidelity) environment with a large volume of low-fidelity simulation data, thereby forming an unbiased, reduced-variance estimator for the policy gradient.
  • Figure 2: Variance of policy gradient estimates for mfpg versus variants of the High-Fidelity Only baseline on Hopper-v3 (top), and the percentage of mfpg variance relative to single-fidelity counterparts of the same batch size (bottom). In high-fidelity data-scarce regimes with mild dynamics gaps, mfpg generally exhibits far lower pg variance than the single-fidelity counterparts.
  • Figure 3: Final evaluation returns of mfpg versus baselines in high-fidelity data-scarce regimes. On Hopper-v3 and Walker2d-v3, under mild–moderate multi-fidelity dynamics gaps, mfpg reliably improves median performance over the High-Fidelity Only baseline and matches that of leading multi-fidelity methods, despite its simplicity and minimal tuning overhead. Under large dynamics gaps, all multi-fidelity approaches converge toward the High-Fidelity Only baseline performance. On the more challenging HalfCheetah-v3, training stability decreases; here we additionally report the AUC metric (\ref{['fig:main-odrl-results-auc']}) to capture accumulated performance, as it may diverge from final return trends. In this task, mfpg ranks among the best-performing methods overall and demonstrates the strongest robustness (cf. \ref{['fig:main-odrl-results-auc']}).
  • Figure 4: AUC of mfpg versus baselines in high-fidelity data-scarce regimes on HalfCheetah-v3 with friction variations. Under an extreme dynamics gap (5× friction), we additionally report evaluation return curves for all methods---mfpg is the only multi-fidelity approach that remains comparable to High-Fidelity Only in accumulated performance (AUC), while all multi-fidelity methods suffer severe degradation in terms of final return (cf., \ref{['fig:main-odrl-results-final-return']}).
  • Figure 5: Evaluation return curves of mfpg versus the High-Fidelity Only baseline on Hopper-v3, corresponding to \ref{['fig:main-odrl-results-final-return']}, under mild, moderate, and large gravity variations (left to right: 0.8×, 2.0×, 5.0×), together with estimated Pearson correlation coefficients between high- and low-fidelity policy gradient losses. mfpg performance increases as the cross-fidelity correlation strengthens.
  • ...and 10 more figures

Theorems & Definitions (6)

  • Theorem 1
  • Lemma 1: Adapted from zhang2020global
  • proof
  • Lemma 2
  • proof
  • proof