Learning to Plan, Planning to Learn: Adaptive Hierarchical RL-MPC for Sample-Efficient Decision Making
Toshiaki Hori, Jonathan DeCastro, Deepak Gopinath, Avinash Balachandran, Guy Rosman
TL;DR
The paper tackles the challenge of sample-efficient planning in safety- or cost-constrained domains by fusing high-level reinforcement learning with a sample-based MPPI controller. It introduces a bi-directional RL–MPC architecture where MPPI rollouts serve as structured virtual data to accelerate value learning and policy improvement, while the RL policy steers MPPI via high-level objective shaping. A two-buffer data system combined with an adaptive influence ratio controls the mix of real and virtual data, and a formal bound quantifies value-function error under model mismatch and resampling biases. Empirical results across Acrobot, Lunar Lander, and CARLA Racing show improved data efficiency and task success, with adaptive ρ delivering the strongest gains in misspecified domains and notably faster convergence in racing scenarios.
Abstract
We propose a new approach for solving planning problems with a hierarchical structure, fusing reinforcement learning and MPC planning. Our formulation tightly and elegantly couples the two planning paradigms. It leverages reinforcement learning actions to inform the MPPI sampler, and adaptively aggregates MPPI samples to inform the value estimation. The resulting adaptive process leverages further MPPI exploration where value estimates are uncertain, and improves training robustness and the overall resulting policies. This results in a robust planning approach that can handle complex planning problems and easily adapts to different applications, as demonstrated over several domains, including race driving, modified Acrobot, and Lunar Lander with added obstacles. Our results in these domains show better data efficiency and overall performance in terms of both rewards and task success, with up to a 72% increase in success rate compared to existing approaches, as well as accelerated convergence (x2.1) compared to non-adaptive sampling.
