Table of Contents
Fetching ...

Learning to Plan, Planning to Learn: Adaptive Hierarchical RL-MPC for Sample-Efficient Decision Making

Toshiaki Hori, Jonathan DeCastro, Deepak Gopinath, Avinash Balachandran, Guy Rosman

TL;DR

The paper tackles the challenge of sample-efficient planning in safety- or cost-constrained domains by fusing high-level reinforcement learning with a sample-based MPPI controller. It introduces a bi-directional RL–MPC architecture where MPPI rollouts serve as structured virtual data to accelerate value learning and policy improvement, while the RL policy steers MPPI via high-level objective shaping. A two-buffer data system combined with an adaptive influence ratio controls the mix of real and virtual data, and a formal bound quantifies value-function error under model mismatch and resampling biases. Empirical results across Acrobot, Lunar Lander, and CARLA Racing show improved data efficiency and task success, with adaptive ρ delivering the strongest gains in misspecified domains and notably faster convergence in racing scenarios.

Abstract

We propose a new approach for solving planning problems with a hierarchical structure, fusing reinforcement learning and MPC planning. Our formulation tightly and elegantly couples the two planning paradigms. It leverages reinforcement learning actions to inform the MPPI sampler, and adaptively aggregates MPPI samples to inform the value estimation. The resulting adaptive process leverages further MPPI exploration where value estimates are uncertain, and improves training robustness and the overall resulting policies. This results in a robust planning approach that can handle complex planning problems and easily adapts to different applications, as demonstrated over several domains, including race driving, modified Acrobot, and Lunar Lander with added obstacles. Our results in these domains show better data efficiency and overall performance in terms of both rewards and task success, with up to a 72% increase in success rate compared to existing approaches, as well as accelerated convergence (x2.1) compared to non-adaptive sampling.

Learning to Plan, Planning to Learn: Adaptive Hierarchical RL-MPC for Sample-Efficient Decision Making

TL;DR

The paper tackles the challenge of sample-efficient planning in safety- or cost-constrained domains by fusing high-level reinforcement learning with a sample-based MPPI controller. It introduces a bi-directional RL–MPC architecture where MPPI rollouts serve as structured virtual data to accelerate value learning and policy improvement, while the RL policy steers MPPI via high-level objective shaping. A two-buffer data system combined with an adaptive influence ratio controls the mix of real and virtual data, and a formal bound quantifies value-function error under model mismatch and resampling biases. Empirical results across Acrobot, Lunar Lander, and CARLA Racing show improved data efficiency and task success, with adaptive ρ delivering the strongest gains in misspecified domains and notably faster convergence in racing scenarios.

Abstract

We propose a new approach for solving planning problems with a hierarchical structure, fusing reinforcement learning and MPC planning. Our formulation tightly and elegantly couples the two planning paradigms. It leverages reinforcement learning actions to inform the MPPI sampler, and adaptively aggregates MPPI samples to inform the value estimation. The resulting adaptive process leverages further MPPI exploration where value estimates are uncertain, and improves training robustness and the overall resulting policies. This results in a robust planning approach that can handle complex planning problems and easily adapts to different applications, as demonstrated over several domains, including race driving, modified Acrobot, and Lunar Lander with added obstacles. Our results in these domains show better data efficiency and overall performance in terms of both rewards and task success, with up to a 72% increase in success rate compared to existing approaches, as well as accelerated convergence (x2.1) compared to non-adaptive sampling.

Paper Structure

This paper contains 48 sections, 2 theorems, 35 equations, 8 figures, 6 tables, 2 algorithms.

Key Result

theorem 1

Let $\pi$ be the policy of the combined RL-MPC approach and $\pi^{\star}$ be the optimal policy under the true MDP, and take the bounds from Assumption assump:boundedness. Let $\hat{V}$ be a value function estimate, and let We can bound the value function error using the proposed approach to the value obtained from the optimal policy in the true domain according to: where $R_{\max} = \sup_{s, a}

Figures (8)

  • Figure 1: Diagram of the combined approach. Samples from the RL policy generates actions that are fed to MPPI, which then generates a set of candidates $m_0, m_1, \ldots$. One $m^{\star}$ is selected and applied to the real environment. The remainder are stored in a buffer $\mathcal{D}_{\operatorname{MPPI}}$. Data from the two buffers is sampled from a convex combination on the parameter $\rho_t$, defined by the uncertainty as estimated by a critic ensemble. The data is passed to RL for value and policy iteration.
  • Figure 2: Experimental environments.
  • Figure 3: Three figures on the left: Episode reward of each environment. averaged over 5 seeds. Rightmost figure: Episode reward for PPO-MPPI($\rho=0.3$ v.s. $\rho_0=0.3, \lambda=0.98$) in Racing environment.
  • Figure 4: Episode reward under the quadratic (QP) cost formulation averaged over 5 seeds.
  • Figure 5: Episode reward under using the RL value function $V$ as the terminal cost averaged over 5 seeds.
  • ...and 3 more figures

Theorems & Definitions (4)

  • theorem 1
  • lemma 1: $H$-Step Simulation Lemma
  • proof
  • proof