Table of Contents
Fetching ...

Bisimulation metric for Model Predictive Control

Yutaka Shimizu, Masayoshi Tomizuka

TL;DR

BS-MPC addresses stability, noise robustness, and computational efficiency gaps in model-based RL by training the encoder with a $\pi^*$-bisimulation loss and integrating this with a model-predictive control framework. It maintains the TD-MPC architecture but adds explicit encoder supervision and a parallelizable computation flow, enabling faster training and stronger guarantees on latent-space fidelity. Theoretical analysis bounds the cumulative reward difference between the original state space and the learned latent space, while empirical results on the DM Control Suite show improved performance and robustness, including under input distractions. The approach offers a practical, scalable pathway to robust model-based planning in high-dimensional, noisy environments.

Abstract

Model-based reinforcement learning has shown promise for improving sample efficiency and decision-making in complex environments. However, existing methods face challenges in training stability, robustness to noise, and computational efficiency. In this paper, we propose Bisimulation Metric for Model Predictive Control (BS-MPC), a novel approach that incorporates bisimulation metric loss in its objective function to directly optimize the encoder. This time-step-wise direct optimization enables the learned encoder to extract intrinsic information from the original state space while discarding irrelevant details and preventing the gradients and errors from diverging. BS-MPC improves training stability, robustness against input noise, and computational efficiency by reducing training time. We evaluate BS-MPC on both continuous control and image-based tasks from the DeepMind Control Suite, demonstrating superior performance and robustness compared to state-of-the-art baseline methods.

Bisimulation metric for Model Predictive Control

TL;DR

BS-MPC addresses stability, noise robustness, and computational efficiency gaps in model-based RL by training the encoder with a -bisimulation loss and integrating this with a model-predictive control framework. It maintains the TD-MPC architecture but adds explicit encoder supervision and a parallelizable computation flow, enabling faster training and stronger guarantees on latent-space fidelity. Theoretical analysis bounds the cumulative reward difference between the original state space and the learned latent space, while empirical results on the DM Control Suite show improved performance and robustness, including under input distractions. The approach offers a practical, scalable pathway to robust model-based planning in high-dimensional, noisy environments.

Abstract

Model-based reinforcement learning has shown promise for improving sample efficiency and decision-making in complex environments. However, existing methods face challenges in training stability, robustness to noise, and computational efficiency. In this paper, we propose Bisimulation Metric for Model Predictive Control (BS-MPC), a novel approach that incorporates bisimulation metric loss in its objective function to directly optimize the encoder. This time-step-wise direct optimization enables the learned encoder to extract intrinsic information from the original state space while discarding irrelevant details and preventing the gradients and errors from diverging. BS-MPC improves training stability, robustness against input noise, and computational efficiency by reducing training time. We evaluate BS-MPC on both continuous control and image-based tasks from the DeepMind Control Suite, demonstrating superior performance and robustness compared to state-of-the-art baseline methods.
Paper Structure (33 sections, 5 theorems, 19 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 33 sections, 5 theorems, 19 equations, 11 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

(Theorem 1 in zhang2021learning) Let's assume a policy $\pi$ in BS-MPC continuously improves over time, converging to the optimal policy $\pi^*$. Under this assumption, the following bisimulation metric has a least fixed point $\tilde{d}$ and that is a $\pi^*$-bisimulation metric. where $W_p(d)(\mathcal{P}_i, \mathcal{P}_j) = \left( \inf_{\gamma' \in \Gamma(\mathcal{P}_i, \mathcal{P}_j)} \int_{\m

Figures (11)

  • Figure 1: Three open problems of TD-MPC. (Left) TD-MPC initially performs well but collapses after 4 million steps, while BS-MPC steadily improves. (Middle) With added distraction in the input image, TD-MPC fails to gain rewards, whereas BS-MPC remains robust. (Right) BS-MPC reduces training time by removing sequential computation in objective function.
  • Figure 2: Calculation flow comparison. The black line shows the forward calculation flow, and the red arrows represent the gradient of $\theta^h$. While TD-MPC needs sequential calculation in its forward computational flow, BS-MPC can process all the calculation parallel. Moreover, BS-MPC has explicit encoder loss in its cost function, so its derivative directly updates the parameters of the encoder. Note that TD-MPC only encodes the original observation at the initial time step and predicts latent states by using the latent dynamics model.
  • Figure 5: Performance comparison on the average over 26 state-based tasks and 9 DM Control tasks with state input. At each evaluation step, the episode return is computed over 10 episodes. The results are averaged over 3 seeds, with shaded regions representing the standard deviation. Results for SAC and Dreamer-v3 are obtained from tdmpc2, and results for TD-MPC are reproduced using their official code with the same architecture and hyperparameters for BS-MPC. We use the same seeds for evaluation.
  • Figure 6: Performance comparison on 10 DM Control image-based tasks. At each evaluation step, the episode return is computed over 10 episodes. The results are averaged over 3 seeds, with shaded regions representing the standard deviation. Results for DrQ-v2 are obtained from their official results, and results for CURL, SAC and Dreamer-v3 are obtained from Dreamer-v3 code dreamer-v3.
  • Figure 7: Performance comparison on 5 DM Control image-based tasks with distracted information from Kinetics dataset. At each evaluation step, the episode return is computed over 10 episodes. The results are averaged over 5 seeds, with shaded regions representing the standard deviation. (Top) Original Image. (Middle) Distracted Image. (Bottom) Performance results. BS-MPC constantly outperforms TD-MPC when the input is disturbed.
  • ...and 6 more figures

Theorems & Definitions (9)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Lemma 1
  • proof
  • Theorem 3
  • proof