Table of Contents
Fetching ...

WIMLE: Uncertainty-Aware World Models with IMLE for Sample-Efficient Continuous Control

Mehran Aghabozorgi, Alireza Moazeni, Yanshu Zhang, Ke Li

TL;DR

WIMLE advances model-based reinforcement learning by integrating Implicit Maximum Likelihood Estimation to learn stochastic, multi-modal world models, and by weighting synthetic rollouts according to predictive uncertainty. An ensemble of IMLE-based models provides per-transition uncertainty estimates $\ abla \sigma(s,a)$ that inform inverse-variance weighting of TD updates, preserving the Bellman target while reducing the impact of high-variance predictions. Theoretical results show positive weights do not alter the Bellman fixed point and that inverse-variance weighting minimizes estimator covariance in linear settings, supporting faster, more stable learning. Empirically, WIMLE achieves superior sample efficiency and competitive asymptotic performance across 40 tasks in DMC, MyoSuite, and HumanoidBench, with notable gains on challenging Humanoid-run and HumanoidBench tasks, demonstrating the practical value of multi-modality and uncertainty-aware training for robust model-based RL.

Abstract

Model-based reinforcement learning promises strong sample efficiency but often underperforms in practice due to compounding model error, unimodal world models that average over multi-modal dynamics, and overconfident predictions that bias learning. We introduce WIMLE, a model-based method that extends Implicit Maximum Likelihood Estimation (IMLE) to the model-based RL framework to learn stochastic, multi-modal world models without iterative sampling and to estimate predictive uncertainty via ensembles and latent sampling. During training, WIMLE weights each synthetic transition by its predicted confidence, preserving useful model rollouts while attenuating bias from uncertain predictions and enabling stable learning. Across $40$ continuous-control tasks spanning DeepMind Control, MyoSuite, and HumanoidBench, WIMLE achieves superior sample efficiency and competitive or better asymptotic performance than strong model-free and model-based baselines. Notably, on the challenging Humanoid-run task, WIMLE improves sample efficiency by over $50$\% relative to the strongest competitor, and on HumanoidBench it solves $8$ of $14$ tasks (versus $4$ for BRO and $5$ for SimbaV2). These results highlight the value of IMLE-based multi-modality and uncertainty-aware weighting for stable model-based RL.

WIMLE: Uncertainty-Aware World Models with IMLE for Sample-Efficient Continuous Control

TL;DR

WIMLE advances model-based reinforcement learning by integrating Implicit Maximum Likelihood Estimation to learn stochastic, multi-modal world models, and by weighting synthetic rollouts according to predictive uncertainty. An ensemble of IMLE-based models provides per-transition uncertainty estimates that inform inverse-variance weighting of TD updates, preserving the Bellman target while reducing the impact of high-variance predictions. Theoretical results show positive weights do not alter the Bellman fixed point and that inverse-variance weighting minimizes estimator covariance in linear settings, supporting faster, more stable learning. Empirically, WIMLE achieves superior sample efficiency and competitive asymptotic performance across 40 tasks in DMC, MyoSuite, and HumanoidBench, with notable gains on challenging Humanoid-run and HumanoidBench tasks, demonstrating the practical value of multi-modality and uncertainty-aware training for robust model-based RL.

Abstract

Model-based reinforcement learning promises strong sample efficiency but often underperforms in practice due to compounding model error, unimodal world models that average over multi-modal dynamics, and overconfident predictions that bias learning. We introduce WIMLE, a model-based method that extends Implicit Maximum Likelihood Estimation (IMLE) to the model-based RL framework to learn stochastic, multi-modal world models without iterative sampling and to estimate predictive uncertainty via ensembles and latent sampling. During training, WIMLE weights each synthetic transition by its predicted confidence, preserving useful model rollouts while attenuating bias from uncertain predictions and enabling stable learning. Across continuous-control tasks spanning DeepMind Control, MyoSuite, and HumanoidBench, WIMLE achieves superior sample efficiency and competitive or better asymptotic performance than strong model-free and model-based baselines. Notably, on the challenging Humanoid-run task, WIMLE improves sample efficiency by over \% relative to the strongest competitor, and on HumanoidBench it solves of tasks (versus for BRO and for SimbaV2). These results highlight the value of IMLE-based multi-modality and uncertainty-aware weighting for stable model-based RL.
Paper Structure (46 sections, 46 equations, 14 figures, 10 tables, 3 algorithms)

This paper contains 46 sections, 46 equations, 14 figures, 10 tables, 3 algorithms.

Figures (14)

  • Figure 1: Sample efficiency on challenging tasks from each benchmark suite. WIMLE achieves superior sample efficiency and asymptotic performance over strong model-free and model-based baselines. Y-axes show interquartile mean. Shaded areas indicate 95% confidence intervals.
  • Figure 2: Wall-clock comparison among model-based methods (3 seeds) on a single NVIDIA L40S GPU for the humanoid-run task. Y-axis shows interquartile mean; shaded areas indicate 95% confidence intervals.
  • Figure 3: WIMLE world model architecture.
  • Figure 4: Aggregate results across benchmarks. WIMLE outperforms strong model-free and model-based baselines overall. Gains are most pronounced on the challenging Dog & Humanoid subset, where it achieves superior sample efficiency and asymptotic performance. On MyoSuite, it performs asymptotically on par with strong baselines that are already near the maximum score (1.0), and on HumanoidBench it significantly outperforms the baselines, solving $8/14$ tasks versus BRO $4$ and SimbaV2 $5$. Y-axes show interquartile mean; shaded areas denote $95\%$ confidence intervals.
  • Figure 5: Uncertainty-aware weighting reduces model bias and enables stable training at longer horizons on Humanoid-run. Left: Uncertainty-aware WIMLE compared to an unweighted variant that is identical except all per-transition weights are fixed to $w_i=1.0$ and a model-free variant that is identical except that it does not use the model; the unweighted curve lags and can even underperform the model-free variant early on, indicating that ignoring uncertainty will bias learning and hinder performance. Right: Rollout ablation ($H=1,4,6,8$) for WIMLE: increasing the model rollout horizon from $H{=}1$ to $H{=}4$ to $H{=}6$ improves performance, and extending to $H{=}8$ does not substantially degrade performance, suggesting that uncertainty-aware weighting mitigates harm from error accumulation at longer horizons. All variants use the same SAC backbone and distributional critics; only the ablated components differ. All plots are on DMC's Humanoid-run task with $5$ seeds.
  • ...and 9 more figures