Table of Contents
Fetching ...

Acting upon Imagination: when to trust imagined trajectories in model based reinforcement learning

Adrian Remonda, Eduardo Veas, Granit Luzhnica

TL;DR

This work tackles the high computational cost of replanning in model-based reinforcement learning with imagined trajectories. It introduces online uncertainty estimation methods that decide when to trust and continue with an existing plan versus replan, including N-Skip, First Step Alike, Confidence Bounds, FUT, and BICHO. Across pre-trained and online-updating dynamics, the methods substantially reduce planning computations (often by 20–80%) with little to no loss in reward, validated on MuJoCo tasks with a shooting MPC framework. The findings demonstrate that forward-propagation–based uncertainty checks (FUT, BICHO) generally offer the best trade-offs, enabling efficient, reliable decision-making in complex environments.

Abstract

Model-based reinforcement learning (MBRL) aims to learn model(s) of the environment dynamics that can predict the outcome of its actions. Forward application of the model yields so called imagined trajectories (sequences of action, predicted state-reward) used to optimize the set of candidate actions that maximize expected reward. The outcome, an ideal imagined trajectory or plan, is imperfect and typically MBRL relies on model predictive control (MPC) to overcome this by continuously re-planning from scratch, incurring thus major computational cost and increasing complexity in tasks with longer receding horizon. We propose uncertainty estimation methods for online evaluation of imagined trajectories to assess whether further planned actions can be trusted to deliver acceptable reward. These methods include comparing the error after performing the last action with the standard expected error and using model uncertainty to assess the deviation from expected outcomes. Additionally, we introduce methods that exploit the forward propagation of the dynamics model to evaluate if the remainder of the plan aligns with expected results and assess the remainder of the plan in terms of the expected reward. Our experiments demonstrate the effectiveness of the proposed uncertainty estimation methods by applying them to avoid unnecessary trajectory replanning in a shooting MBRL setting. Results highlight significant reduction on computational costs without sacrificing performance.

Acting upon Imagination: when to trust imagined trajectories in model based reinforcement learning

TL;DR

This work tackles the high computational cost of replanning in model-based reinforcement learning with imagined trajectories. It introduces online uncertainty estimation methods that decide when to trust and continue with an existing plan versus replan, including N-Skip, First Step Alike, Confidence Bounds, FUT, and BICHO. Across pre-trained and online-updating dynamics, the methods substantially reduce planning computations (often by 20–80%) with little to no loss in reward, validated on MuJoCo tasks with a shooting MPC framework. The findings demonstrate that forward-propagation–based uncertainty checks (FUT, BICHO) generally offer the best trade-offs, enabling efficient, reliable decision-making in complex environments.

Abstract

Model-based reinforcement learning (MBRL) aims to learn model(s) of the environment dynamics that can predict the outcome of its actions. Forward application of the model yields so called imagined trajectories (sequences of action, predicted state-reward) used to optimize the set of candidate actions that maximize expected reward. The outcome, an ideal imagined trajectory or plan, is imperfect and typically MBRL relies on model predictive control (MPC) to overcome this by continuously re-planning from scratch, incurring thus major computational cost and increasing complexity in tasks with longer receding horizon. We propose uncertainty estimation methods for online evaluation of imagined trajectories to assess whether further planned actions can be trusted to deliver acceptable reward. These methods include comparing the error after performing the last action with the standard expected error and using model uncertainty to assess the deviation from expected outcomes. Additionally, we introduce methods that exploit the forward propagation of the dynamics model to evaluate if the remainder of the plan aligns with expected results and assess the remainder of the plan in terms of the expected reward. Our experiments demonstrate the effectiveness of the proposed uncertainty estimation methods by applying them to avoid unnecessary trajectory replanning in a shooting MBRL setting. Results highlight significant reduction on computational costs without sacrificing performance.

Paper Structure

This paper contains 23 sections, 7 figures, 6 tables, 7 algorithms.

Figures (7)

  • Figure 1: The provided figure depicts a running example for a proposed method called FUT with task horizon H=5 and demonstrates a replanning event at time step t=4. The states obtained from the environment are represented in black. The red line represents the imagined trajectory at state $s_t$, while the green, blue and orange lines show the projected outcomes from states $s_{t+1}$, $s_{t+2}$ and $s_{t+3}$, respectively. A replanning was trigger at $s_{t+4}$ as the projected trajectory shown in orange deviates from the expected outcome. FSA and CB methods observe the outcome of the last action in the trajectory. FSA compares the error after performing the last action with the standard expected error, while CB assesses the deviation with respect to expected outcomes using model uncertainty. ($\epsilon$ in the figure). FUT and BICHO exploit the forward propagation of the dynamics model. FUT evaluates whether the remainder of the plan aligns with expected results, and BICHO assesses the remainder of the plan in terms of the expected reward.
  • Figure 2: Left Trajectory errors of predicted future steps along with the minimal error at each step and average error at first step. Right Episode reward of CP as a function of environment steps. The agent task is to hold the pole up, reward ranges from +1 (pole is up) to 0 (pole is down).
  • Figure 3: Acting upon imagination with pre-trained dynamics. Average of the maximum reward in relation to replanning rate for CP environment using 10 runs for the n-skip, FSA, CB and FUT20 methods. Baseline(n=0), which is PETs, represents 100% replanning. SAC and PPO at convergence are shown as reference. They are visualized as dotted lines (in x-axis) just for reference.
  • Figure 4: Performance while training the dynamics model. Episode reward in relation to the number of the relative wall time. The vertical lines represent the end of training for the given method. Baseline(n=0) is PETs and represents 100% recalculation.
  • Figure 5: Trajectories error as a function of predicted future steps for all environments.
  • ...and 2 more figures