Table of Contents
Fetching ...

Differentiable Information Enhanced Model-Based Reinforcement Learning

Xiaoyuan Zhang, Xinyan Cai, Bo Liu, Weidong Huang, Song-Chun Zhu, Siyuan Qi, Yaodong Yang

TL;DR

MB-MIX introduces a differentiable information enhanced MBRL framework that jointly leverages trajectory length mixing and Sobolev model training to stabilize gradient-based policy optimization in differentiable environments. The method formalizes $J^{mix}_{\\pi}(\\theta)=(1-\\lambda)\\sum_{H=1}^{\\infty}\\lambda^{H-1} J^{H}_{\\pi}(\\theta)$ and trains dynamics with the Sobolev loss $J_{M}(\\varphi)$ to enforce gradient-consistency, with theory showing $\\operatorname{Var}(A^{MIX}) \\\le \\\operatorname{Var}(A^{SHAC})$ for $\\gamma<1$. Empirically, MB-MIX outperforms state-of-the-art baselines across DiffRL, Bruce humanoid, Brax, and DaXBench, achieving higher rewards and greater stability in both rigid- and deformable-object tasks. This work advances practical deployment of differentiable simulators by improving sample efficiency, gradient reliability, and robustness in complex robotics domains.

Abstract

Differentiable environments have heralded new possibilities for learning control policies by offering rich differentiable information that facilitates gradient-based methods. In comparison to prevailing model-free reinforcement learning approaches, model-based reinforcement learning (MBRL) methods exhibit the potential to effectively harness the power of differentiable information for recovering the underlying physical dynamics. However, this presents two primary challenges: effectively utilizing differentiable information to 1) construct models with more accurate dynamic prediction and 2) enhance the stability of policy training. In this paper, we propose a Differentiable Information Enhanced MBRL method, MB-MIX, to address both challenges. Firstly, we adopt a Sobolev model training approach that penalizes incorrect model gradient outputs, enhancing prediction accuracy and yielding more precise models that faithfully capture system dynamics. Secondly, we introduce mixing lengths of truncated learning windows to reduce the variance in policy gradient estimation, resulting in improved stability during policy learning. To validate the effectiveness of our approach in differentiable environments, we provide theoretical analysis and empirical results. Notably, our approach outperforms previous model-based and model-free methods, in multiple challenging tasks involving controllable rigid robots such as humanoid robots' motion control and deformable object manipulation.

Differentiable Information Enhanced Model-Based Reinforcement Learning

TL;DR

MB-MIX introduces a differentiable information enhanced MBRL framework that jointly leverages trajectory length mixing and Sobolev model training to stabilize gradient-based policy optimization in differentiable environments. The method formalizes and trains dynamics with the Sobolev loss to enforce gradient-consistency, with theory showing for . Empirically, MB-MIX outperforms state-of-the-art baselines across DiffRL, Bruce humanoid, Brax, and DaXBench, achieving higher rewards and greater stability in both rigid- and deformable-object tasks. This work advances practical deployment of differentiable simulators by improving sample efficiency, gradient reliability, and robustness in complex robotics domains.

Abstract

Differentiable environments have heralded new possibilities for learning control policies by offering rich differentiable information that facilitates gradient-based methods. In comparison to prevailing model-free reinforcement learning approaches, model-based reinforcement learning (MBRL) methods exhibit the potential to effectively harness the power of differentiable information for recovering the underlying physical dynamics. However, this presents two primary challenges: effectively utilizing differentiable information to 1) construct models with more accurate dynamic prediction and 2) enhance the stability of policy training. In this paper, we propose a Differentiable Information Enhanced MBRL method, MB-MIX, to address both challenges. Firstly, we adopt a Sobolev model training approach that penalizes incorrect model gradient outputs, enhancing prediction accuracy and yielding more precise models that faithfully capture system dynamics. Secondly, we introduce mixing lengths of truncated learning windows to reduce the variance in policy gradient estimation, resulting in improved stability during policy learning. To validate the effectiveness of our approach in differentiable environments, we provide theoretical analysis and empirical results. Notably, our approach outperforms previous model-based and model-free methods, in multiple challenging tasks involving controllable rigid robots such as humanoid robots' motion control and deformable object manipulation.

Paper Structure

This paper contains 15 sections, 1 theorem, 6 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Suppose Assumptions $1, 2$, $3$, and $4$ hold (please see appendix). Then in a differentiable environment, the Variance of the MIX policy gradient estimate ($A^{\textup{MIX}}$) will be equal to or less than the Variance of the SHAC policy gradient estimate ($A^{\textup{SHAC}}$): Here, $A^{\text{MIX}}$ and $A^{\text{SHAC}}$ respectively denote the MIX and SHAC policy gradient estimates, the balanc

Figures (7)

  • Figure 1: Algorithm diagram. We propose a differentiable information enhanced model-based reinforcement learning approach, MB-MIX, which uses Sobolev model training method to learn a dynamics model that leverages gradient information(Diff Info) from the differentiable environment. We perform rollouts in the differentiable environment as well as in the learned model, and employ Trajectory Length Mix to weight and sum the optimization functions. Policy updates are then performed through Back-propagation. It is worth noting that in our method, the Model-Training with Diff Info and the gradient-based Policy-Training method are consistent.
  • Figure 2: Task Gallery. We assessed the algorithm's effectiveness in four differentiable environments, encompassing various control tasks. (a) DiffRL: The agent controls a range of robots, including those with muscles. (b) Bruce, Humanoid Robot: Designed for tasks like Fast-Run, was introduced into the DiffRL environment to extend its real-world applications. (c) Brax: Involving advanced tasks such as Fetch and Grasp (d) DaxBench: Entailing a series of tasks related to deformable objects manipulation.
  • Figure 3: Experiment results in Tabular case. We show the effectiveness of mixing trajectory length in a designed simple tabular case environment. The legend "reward/step" on the y-axis denotes the average reward. The right end of the horizontal axis represents 1e6 environment steps.
  • Figure 4: Impact of different trajectory lengths on training. The vertical axis of the figure represents reward, while the horizontal axis represents the maximum length of trajectories. Our Mix method enhances policy training stability.
  • Figure 5: Experiments on Bruce, humanoid robot. The top-left figure demonstrates that our proposed MB-MIX method surpasses all model-free and model-based algorithms, the vertical axis represents the rewards. In the top-right figure, we visualize the performance in the “Fast Run” task. Starting from the same position and after the same amount of time, our MB-MIX method enables the trained robot to move further. Judging from the alternation of the red and blue lines, which track the footsteps, our MB-MIX algorithm achieves better alternating leg movements.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Theorem 1