Table of Contents
Fetching ...

Soft MPCritic: Amortized Model Predictive Value Iteration

Thomas Banker, Nathan P. Lawrence, Ali Mesbah

Abstract

Reinforcement learning (RL) and model predictive control (MPC) offer complementary strengths, yet combining them at scale remains computationally challenging. We propose soft MPCritic, an RL-MPC framework that learns in (soft) value space while using sample-based planning for both online control and value target generation. soft MPCritic instantiates MPC through model predictive path integral control (MPPI) and trains a terminal Q-function with fitted value iteration, aligning the learned value function with the planner and implicitly extending the effective planning horizon. We introduce an amortized warm-start strategy that recycles planned open-loop action sequences from online observations when computing batched MPPI-based value targets. This makes soft MPCritic computationally practical, while preserving solution quality. soft MPCritic plans in a scenario-based fashion with an ensemble of dynamic models trained for next-step prediction accuracy. Together, these ingredients enable soft MPCritic to learn effectively through robust, short-horizon planning on classic and complex control tasks. These results establish soft MPCritic as a practical and scalable blueprint for synthesizing MPC policies in settings where policy extraction and direct, long-horizon planning may fail.

Soft MPCritic: Amortized Model Predictive Value Iteration

Abstract

Reinforcement learning (RL) and model predictive control (MPC) offer complementary strengths, yet combining them at scale remains computationally challenging. We propose soft MPCritic, an RL-MPC framework that learns in (soft) value space while using sample-based planning for both online control and value target generation. soft MPCritic instantiates MPC through model predictive path integral control (MPPI) and trains a terminal Q-function with fitted value iteration, aligning the learned value function with the planner and implicitly extending the effective planning horizon. We introduce an amortized warm-start strategy that recycles planned open-loop action sequences from online observations when computing batched MPPI-based value targets. This makes soft MPCritic computationally practical, while preserving solution quality. soft MPCritic plans in a scenario-based fashion with an ensemble of dynamic models trained for next-step prediction accuracy. Together, these ingredients enable soft MPCritic to learn effectively through robust, short-horizon planning on classic and complex control tasks. These results establish soft MPCritic as a practical and scalable blueprint for synthesizing MPC policies in settings where policy extraction and direct, long-horizon planning may fail.

Paper Structure

This paper contains 14 sections, 14 equations, 4 figures, 1 table, 2 algorithms.

Figures (4)

  • Figure C1: soft MPCritic effectively combines key approximations of RL and MPPI for autonomous, self-improvement. During online control, the value-space planner updates to initial open-loop actions $\upsilon$ to align with the soft minimum of value function $\mathcal{Q}_\phi$. The updated sequence $\upsilon^+$, along with transition $(s,a,\ell,s')$, is stored in the replay buffer to warm-start the planner's value targets $\mathcal{V}^{\text{MPPI}}_\phi$. After a value iteration step, the planner is updated with the new $\mathcal{Q}_\phi$, and the further-refined solution $\upsilon^+$ is stored for later reuse.
  • Figure D1: Performance of warm and cold starting MPPI target computations when utilizing a single or ensemble of dynamic models $f$. Lines represent the median of $10$ seeds, shading percentiles. The horizontal dashed line corresponds to SAC.
  • Figure D2: Ablation of soft MPCritic with Gaussian prior. Left: using only a dynamics model ensemble ($f$ Ensemble), terminal $\mathcal{Q}$-function ($\mathcal{Q}$), or both. Right: using MPPI only for control, $\mathcal{Q}$-function targets, or both. Lines represent the median of $10$ seeds, shading percentiles.
  • Figure D3: Cumulative reward for soft MPCritic with uniform prior and RL baselines (SAC and DDPG). Lines represent the median of $10$ seeds, shading percentiles. The horizontal dashed lines corresponds to SAC and DDPG after $10^6$ time steps.