Table of Contents
Fetching ...

MBDP: A Model-based Approach to Achieve both Robustness and Sample Efficiency via Double Dropout Planning

Wanpeng Zhang, Xi Xiao, Yao Yao, Mingzhe Chen, Dijun Luo

TL;DR

MBDP tackles the robustness-sample efficiency dilemma in model-based reinforcement learning by introducing two complementary dropout mechanisms: rollout_dropout to bias learning toward lower-reward events for robustness, and model_dropout to prune an ensemble by predictive bias for efficiency. The framework establishes theoretical guarantees linking rollout dropout to CVaR robustness and provides explicit bounds on performance degradation due to dropout, while enabling a tunable trade-off via the parameters $\\alpha$ and $\\beta$. Empirically, MBDP shows superior sample efficiency and competitive robustness on MuJoCo tasks compared with leading baselines, with ablations confirming the distinct roles of each dropout component. The results demonstrate a flexible, theoretically grounded approach for balancing robustness and efficiency in practical, continuous-control settings.

Abstract

Model-based reinforcement learning is a widely accepted solution for solving excessive sample demands. However, the predictions of the dynamics models are often not accurate enough, and the resulting bias may incur catastrophic decisions due to insufficient robustness. Therefore, it is highly desired to investigate how to improve the robustness of model-based RL algorithms while maintaining high sampling efficiency. In this paper, we propose Model-Based Double-dropout Planning (MBDP) to balance robustness and efficiency. MBDP consists of two kinds of dropout mechanisms, where the rollout-dropout aims to improve the robustness with a small cost of sample efficiency, while the model-dropout is designed to compensate for the lost efficiency at a slight expense of robustness. By combining them in a complementary way, MBDP provides a flexible control mechanism to meet different demands of robustness and efficiency by tuning two corresponding dropout ratios. The effectiveness of MBDP is demonstrated both theoretically and experimentally.

MBDP: A Model-based Approach to Achieve both Robustness and Sample Efficiency via Double Dropout Planning

TL;DR

MBDP tackles the robustness-sample efficiency dilemma in model-based reinforcement learning by introducing two complementary dropout mechanisms: rollout_dropout to bias learning toward lower-reward events for robustness, and model_dropout to prune an ensemble by predictive bias for efficiency. The framework establishes theoretical guarantees linking rollout dropout to CVaR robustness and provides explicit bounds on performance degradation due to dropout, while enabling a tunable trade-off via the parameters and . Empirically, MBDP shows superior sample efficiency and competitive robustness on MuJoCo tasks compared with leading baselines, with ablations confirming the distinct roles of each dropout component. The results demonstrate a flexible, theoretically grounded approach for balancing robustness and efficiency in practical, continuous-control settings.

Abstract

Model-based reinforcement learning is a widely accepted solution for solving excessive sample demands. However, the predictions of the dynamics models are often not accurate enough, and the resulting bias may incur catastrophic decisions due to insufficient robustness. Therefore, it is highly desired to investigate how to improve the robustness of model-based RL algorithms while maintaining high sampling efficiency. In this paper, we propose Model-Based Double-dropout Planning (MBDP) to balance robustness and efficiency. MBDP consists of two kinds of dropout mechanisms, where the rollout-dropout aims to improve the robustness with a small cost of sample efficiency, while the model-dropout is designed to compensate for the lost efficiency at a slight expense of robustness. By combining them in a complementary way, MBDP provides a flexible control mechanism to meet different demands of robustness and efficiency by tuning two corresponding dropout ratios. The effectiveness of MBDP is demonstrated both theoretically and experimentally.

Paper Structure

This paper contains 27 sections, 6 theorems, 48 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.1

It holds given the constraint set of perturbation

Figures (6)

  • Figure 1: Overview of the MBDP algorithm. When interacting with the environment, we collect samples into environment replay buffer $\mathcal{D}_{\mathrm{env}}$, used for training the simulator model of the environment. Then we implement the model-dropout procedure and perform rollouts on the model ensemble. The sampled data from the model ensemble is filled into a temporary batch, and then we get a dropout buffer $\mathcal{D}_\mathrm{model}$ by implementing the rollout-dropout procedure. Finally, we use samples from $\mathcal{D}_\mathrm{model}$ to optimize the policy $\pi(a|s)$.
  • Figure 2: Learning curves of our MBDP algorithm and four baselines on different continuous control environments. Solid curves indicate the mean of all trials with 5 different seeds. Shaded regions correspond to standard deviation among trials. Each trial is evaluated every 1000 steps. The dashed reference lines are the asymptotic performance of SAC algorithm. These results show that our MBDP method learns faster and has better asymptotic performance and sample efficiency than existing model-based algorithms.
  • Figure 3: The robustness performance is depicted as heat maps for various environment settings. Each heat map represents a set of experiments, and each square in the heat map represents the average return value in one experiment. The closer the color to red (hotter) means the higher the value, the better the algorithm is trained in that environment, and vice versa. The four different algorithms in the figure are no dropout ($\alpha=0, \beta=0$), rollout-dropout only ($\alpha$-dropout: $\alpha=0.2, \beta=0$), model-dropout only ($\beta$-dropout: $\alpha=0, \beta=0.2$), and both dropouts ($\alpha=0.2, \beta=0.2$). Each experiment in the Hopper environment stops after 300,000 steps, and each experiment in the HalfCheetah environment stops after 600,000 steps.
  • Figure 4: The horizontal axis represents the different values of $\alpha,\beta$. The vertical axis is the metric for evaluating the robustness or efficiency. The horizontal dashed line is the baseline case with both rollout-dropout and model-dropout removed ($\alpha=\beta=0$). 120k steps are trained for each experiment in the Hopper environment, and 400k steps are trained for each experiment in the HalfCheetah environment. Each box plot corresponds to 10 different random seeds.
  • Figure 5: Illustrations of the four MuJoCo simulated robot environments used in our experiments.
  • ...and 1 more figures

Theorems & Definitions (11)

  • Theorem 3.1
  • Lemma 3.2
  • Theorem 3.3
  • Proposition 3.4
  • proof
  • Lemma A.1
  • Lemma A.2
  • proof
  • proof
  • proof
  • ...and 1 more