Table of Contents
Fetching ...

A KL-regularization framework for learning to plan with adaptive priors

Álvaro Serra-Gomez, Daniel Jarne Ornia, Dhruva Tirumala, Thomas Moerland

TL;DR

This paper tackles exploration inefficiency in model-based RL for high-dimensional continuous control by unifying MPPI-based methods under a KL-regularized policy optimization framework. It introduces Policy Optimization–Model Predictive Control (PO-MPC), which regularizes the sampling policy toward a planner-derived prior with strength $λ$ and incorporates an adaptive intermediate prior to reduce variance from replay-planner samples. The authors show that prior MPPI-based approaches emerge as special cases of PO-MPC and demonstrate significant gains in sample efficiency and final performance on challenging benchmarks, especially in high-dimensional settings. The work provides a practical, flexible design space for planning-enhanced RL and offers guidance on selecting priors and KL regularization to balance speed of convergence with exploration.

Abstract

Effective exploration remains a central challenge in model-based reinforcement learning (MBRL), particularly in high-dimensional continuous control tasks where sample efficiency is crucial. A prominent line of recent work leverages learned policies as proposal distributions for Model-Predictive Path Integral (MPPI) planning. Initial approaches update the sampling policy independently of the planner distribution, typically maximizing a learned value function with deterministic policy gradient and entropy regularization. However, because the states encountered during training depend on the MPPI planner, aligning the sampling policy with the planner improves the accuracy of value estimation and long-term performance. To this end, recent methods update the sampling policy by minimizing KL divergence to the planner distribution or by introducing planner-guided regularization into the policy update. In this work, we unify these MPPI-based reinforcement learning methods under a single framework by introducing Policy Optimization-Model Predictive Control (PO-MPC), a family of KL-regularized MBRL methods that integrate the planner's action distribution as a prior in policy optimization. By aligning the learned policy with the planner's behavior, PO-MPC allows more flexibility in the policy updates to trade off Return maximization and KL divergence minimization. We clarify how prior approaches emerge as special cases of this family, and we explore previously unstudied variations. Our experiments show that these extended configurations yield significant performance improvements, advancing the state of the art in MPPI-based RL.

A KL-regularization framework for learning to plan with adaptive priors

TL;DR

This paper tackles exploration inefficiency in model-based RL for high-dimensional continuous control by unifying MPPI-based methods under a KL-regularized policy optimization framework. It introduces Policy Optimization–Model Predictive Control (PO-MPC), which regularizes the sampling policy toward a planner-derived prior with strength and incorporates an adaptive intermediate prior to reduce variance from replay-planner samples. The authors show that prior MPPI-based approaches emerge as special cases of PO-MPC and demonstrate significant gains in sample efficiency and final performance on challenging benchmarks, especially in high-dimensional settings. The work provides a practical, flexible design space for planning-enhanced RL and offers guidance on selecting priors and KL regularization to balance speed of convergence with exploration.

Abstract

Effective exploration remains a central challenge in model-based reinforcement learning (MBRL), particularly in high-dimensional continuous control tasks where sample efficiency is crucial. A prominent line of recent work leverages learned policies as proposal distributions for Model-Predictive Path Integral (MPPI) planning. Initial approaches update the sampling policy independently of the planner distribution, typically maximizing a learned value function with deterministic policy gradient and entropy regularization. However, because the states encountered during training depend on the MPPI planner, aligning the sampling policy with the planner improves the accuracy of value estimation and long-term performance. To this end, recent methods update the sampling policy by minimizing KL divergence to the planner distribution or by introducing planner-guided regularization into the policy update. In this work, we unify these MPPI-based reinforcement learning methods under a single framework by introducing Policy Optimization-Model Predictive Control (PO-MPC), a family of KL-regularized MBRL methods that integrate the planner's action distribution as a prior in policy optimization. By aligning the learned policy with the planner's behavior, PO-MPC allows more flexibility in the policy updates to trade off Return maximization and KL divergence minimization. We clarify how prior approaches emerge as special cases of this family, and we explore previously unstudied variations. Our experiments show that these extended configurations yield significant performance improvements, advancing the state of the art in MPPI-based RL.

Paper Structure

This paper contains 23 sections, 18 equations, 7 figures, 3 tables, 2 algorithms.

Figures (7)

  • Figure 1: Performance comparison in 14 state-based high-dimensional control tasks from HumanoidBench humanoidbench. Mean of 3 runs; shaded areas are 95% confidence intervals. In the top left, we visualize results averaged across all tasks except for Reach due to different range.
  • Figure 2: Effects of using a learned intermediate prior, $\pi_{\theta_p}$, instead of the Planning samples, $\pi_P$, from the replay buffer. Mean of 3 runs; shaded areas are 95% CI. We report the average across tasks (Top) and in the Balance Simple task (Bottom). See Appendix \ref{['app:additional_results']} for results on all tasks.
  • Figure 3: Effects of approximating the Planning policy with the intermediate prior through different cost functions. Mean of 3 runs; shaded areas are 95% CI. We report the average across tasks, and environments showing a clear effect of training with loss in Eq. \ref{['eq:fkl_loss']} instead of Eq. \ref{['eq:rkl_loss']}. See Appendix \ref{['app:additional_results']} for results on all tasks.
  • Figure 4: Performance comparison of PO-MPC and the baselines on 7 state-based high-dimensional control tasks from DMControl Suite dmcontrol. Mean of 3 runs; shaded areas are 95% confidence intervals. In the top left, we visualize results averaged across all 7 tasks.
  • Figure 5: Performance comparison in 14 state-based high-dimensional control tasks from HumanoidBench humanoidbench. Mean of 3 runs; shaded areas are 95% confidence intervals. In the top left, we visualize results averaged across all tasks except for Reach, which has a different return range. We observe that using the intermediate policy not only does not harm the performance but also enhances it in some tasks.
  • ...and 2 more figures