Table of Contents
Fetching ...

Bootstrapped Model Predictive Control

Yuhang Wang, Hanwei Guo, Sizhe Wang, Long Qian, Xuguang Lan

TL;DR

BMPC addresses the data-inefficiency of policy learning in plan-based model-based RL by learning a neural policy through imitation of an MPC expert and using that policy to guide MPC planning, while performing on-policy TD-learning for value estimation and employing lazy reanalysis for efficiency. The method combines a TD-MPC2–style world model with MPC planning and a KL-based imitation objective, enabling stronger policy/value learning and reduced planning cost. Empirically, BMPC delivers superior data efficiency and stability, especially on high-dimensional locomotion tasks, and achieves comparable or better asymptotic performance with smaller networks. The work demonstrates the practical potential of leveraging MPC strengths to bootstrap policy learning and planning in continuous control, and provides a reproducible open-source implementation.

Abstract

Model Predictive Control (MPC) has been demonstrated to be effective in continuous control tasks. When a world model and a value function are available, planning a sequence of actions ahead of time leads to a better policy. Existing methods typically obtain the value function and the corresponding policy in a model-free manner. However, we find that such an approach struggles with complex tasks, resulting in poor policy learning and inaccurate value estimation. To address this problem, we leverage the strengths of MPC itself. In this work, we introduce Bootstrapped Model Predictive Control (BMPC), a novel algorithm that performs policy learning in a bootstrapped manner. BMPC learns a network policy by imitating an MPC expert, and in turn, uses this policy to guide the MPC process. Combined with model-based TD-learning, our policy learning yields better value estimation and further boosts the efficiency of MPC. We also introduce a lazy reanalyze mechanism, which enables computationally efficient imitation learning. Our method achieves superior performance over prior works on diverse continuous control tasks. In particular, on challenging high-dimensional locomotion tasks, BMPC significantly improves data efficiency while also enhancing asymptotic performance and training stability, with comparable training time and smaller network sizes. Code is available at https://github.com/wertyuilife2/bmpc.

Bootstrapped Model Predictive Control

TL;DR

BMPC addresses the data-inefficiency of policy learning in plan-based model-based RL by learning a neural policy through imitation of an MPC expert and using that policy to guide MPC planning, while performing on-policy TD-learning for value estimation and employing lazy reanalysis for efficiency. The method combines a TD-MPC2–style world model with MPC planning and a KL-based imitation objective, enabling stronger policy/value learning and reduced planning cost. Empirically, BMPC delivers superior data efficiency and stability, especially on high-dimensional locomotion tasks, and achieves comparable or better asymptotic performance with smaller networks. The work demonstrates the practical potential of leveraging MPC strengths to bootstrap policy learning and planning in continuous control, and provides a reproducible open-source implementation.

Abstract

Model Predictive Control (MPC) has been demonstrated to be effective in continuous control tasks. When a world model and a value function are available, planning a sequence of actions ahead of time leads to a better policy. Existing methods typically obtain the value function and the corresponding policy in a model-free manner. However, we find that such an approach struggles with complex tasks, resulting in poor policy learning and inaccurate value estimation. To address this problem, we leverage the strengths of MPC itself. In this work, we introduce Bootstrapped Model Predictive Control (BMPC), a novel algorithm that performs policy learning in a bootstrapped manner. BMPC learns a network policy by imitating an MPC expert, and in turn, uses this policy to guide the MPC process. Combined with model-based TD-learning, our policy learning yields better value estimation and further boosts the efficiency of MPC. We also introduce a lazy reanalyze mechanism, which enables computationally efficient imitation learning. Our method achieves superior performance over prior works on diverse continuous control tasks. In particular, on challenging high-dimensional locomotion tasks, BMPC significantly improves data efficiency while also enhancing asymptotic performance and training stability, with comparable training time and smaller network sizes. Code is available at https://github.com/wertyuilife2/bmpc.

Paper Structure

This paper contains 14 sections, 7 equations, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: Overview.(left) BMPC learns a network policy through expert imitation and lazy reanalyze mechanism, planning during inference using guided MPC, and performs model-based value learning in an on-policy manner. (right) Averaged evaluation performance of the network policy compared to the MPC policy in BMPC and TD-MPC2 on DMControl tasks. BMPC achieves better policy learning, which further boosts the performance of MPC. Mean and 95% CIs over 5 seeds.
  • Figure 2: Performance gap between TD-MPC2 policies.(left) Evaluation performance of the network policy compared to the MPC policy in TD-MPC2. The network policy struggles with complex tasks like Dog Run and Humanoid Run. Mean and 95% CIs over 5 seeds. (right) Distributions of action value differences during MPPI over environment steps.
  • Figure 3: DMControl tasks. Comparing BMPC to baselines on DMControl tasks. In the top left, we present the average performance of 7 high-dimensional locomotion tasks and all 28 tasks. Mean and 95% CIs over 5 seeds. Training curves for all tasks are provided in Appendix \ref{['sec:appC']}.
  • Figure 4: High-dimensional locomotion tasks. Comparison of BMPC with baselines on the 7 most challenging high-dimensional locomotion tasks; The environment steps are extended to 4M for a comprehensive comparison. In the top left, we present the results averaged over all 7 tasks. Mean and 95% CIs over 5 seeds.
  • Figure 5: HumanoidBench locomotion suite. Comparing BMPC to baselines on HumanoidBench locomotion suite. In the top left, we present the average performance of all 13 tasks except for Reach due to the different reward scales. Mean and 95% CIs over 5 seeds.
  • ...and 8 more figures