Table of Contents
Fetching ...

Diffusion Model Predictive Control

Guangyao Zhou, Sivaramakrishnan Swaminathan, Rajkumar Vasudeva Raju, J. Swaroop Guntupalli, Wolfgang Lehrach, Joseph Ortiz, Antoine Dedieu, Miguel Lázaro-Gredilla, Kevin Murphy

TL;DR

D-MPC advances model-based planning by learning multi-step diffusion-based dynamics and action proposals from offline data, enabling robust online MPC with horizon-based planning. By combining trajectory-level diffusion models with a simple sampling-based planner and a Transformer-based value estimator, it mitigates compounding errors and supports runtime adaptation to novel rewards and dynamics. Empirical results on D4RL show strong performance against MBOP and competitive standing with SOTA methods, with clear ablations confirming the value of multi-step diffusion, task adaptation, and diffusion-based proposals. The approach also demonstrates potential for fast policy distillation to enable high-frequency control, while acknowledging runtime and data distribution limitations inherent to offline RL.

Abstract

We propose Diffusion Model Predictive Control (D-MPC), a novel MPC approach that learns a multi-step action proposal and a multi-step dynamics model, both using diffusion models, and combines them for use in online MPC. On the popular D4RL benchmark, we show performance that is significantly better than existing model-based offline planning methods using MPC (e.g. MBOP) and competitive with state-of-the-art (SOTA) model-based and model-free reinforcement learning methods. We additionally illustrate D-MPC's ability to optimize novel reward functions at run time and adapt to novel dynamics, and highlight its advantages compared to existing diffusion-based planning baselines.

Diffusion Model Predictive Control

TL;DR

D-MPC advances model-based planning by learning multi-step diffusion-based dynamics and action proposals from offline data, enabling robust online MPC with horizon-based planning. By combining trajectory-level diffusion models with a simple sampling-based planner and a Transformer-based value estimator, it mitigates compounding errors and supports runtime adaptation to novel rewards and dynamics. Empirical results on D4RL show strong performance against MBOP and competitive standing with SOTA methods, with clear ablations confirming the value of multi-step diffusion, task adaptation, and diffusion-based proposals. The approach also demonstrates potential for fast policy distillation to enable high-frequency control, while acknowledging runtime and data distribution limitations inherent to offline RL.

Abstract

We propose Diffusion Model Predictive Control (D-MPC), a novel MPC approach that learns a multi-step action proposal and a multi-step dynamics model, both using diffusion models, and combines them for use in online MPC. On the popular D4RL benchmark, we show performance that is significantly better than existing model-based offline planning methods using MPC (e.g. MBOP) and competitive with state-of-the-art (SOTA) model-based and model-free reinforcement learning methods. We additionally illustrate D-MPC's ability to optimize novel reward functions at run time and adapt to novel dynamics, and highlight its advantages compared to existing diffusion-based planning baselines.
Paper Structure (34 sections, 5 equations, 2 figures, 7 tables, 5 algorithms)

This paper contains 34 sections, 5 equations, 2 figures, 7 tables, 5 algorithms.

Figures (2)

  • Figure 1: Novel reward functions can generate interesting agent behaviors. The leftmost column shows an example episode generated by D-MPC trained on the Walker2d medium-replay dataset, using the trained value function in the planner. The remaining three columns present individual examples of behaviors generated using a height-based novel objective in the planner, with each column corresponding to a different target height. The top row of each column displays the agent’s height at each timestep within the episode. The middle row shows two snapshots of the agent per episode, while the bottom row graphs the novel reward (targeted by the planner) and the actual environment-provided reward received by the agent at each timestep. This figure serves as a qualitative demonstration of how novel rewards can be employed to produce interesting behaviors.
  • Figure 2: Accuracy of long-horizon dynamics prediction. We train the dynamics models on the medium dataset and evaluate on medium (training data), medium-replay (lower-quality data), and expert (higher-quality data) datasets. Prediction errors are measured by the median root mean square deviation (RMSD) on non-velocity coordinates based on 1024 sampled state action sequences of length 256. Plots show median $\pm$ 10 percentile bands. The multi-step diffusion dynamics model incurs significantly lower prediction error on training data while maintaining superior generalization abilities, outperforming other single-step and auto-regressive alternatives. The auto-regressive transformer (ART) dynamics model outperforms the single step diffusion dynamics model. The single-step MLP dynamics model exhibits compounding errors that grow rapidly for long-horizon dynamics predictions.