Table of Contents
Fetching ...

Model-based Lookahead Reinforcement Learning

Zhang-Wei Hong, Joni Pajarinen, Jan Peters

TL;DR

This work addresses the data-efficiency gap between model-free and model-based RL by unifying their strengths in a Model Predictive Control with Model-Free RL (MPC-MFRL) framework. It jointly learns a policy, a value function, and a forward dynamics model, and leverages policy-guided trajectory sampling and value-based trajectory evaluation within MPC, supplemented by a soft-greedy action selection. Empirical results on MuJoCo tasks show that MPC-MFRL achieves state-of-the-art data efficiency while matching or surpassing model-free performance, particularly on challenging tasks like Ant and HalfCheetah. The approach demonstrates robust improvements from policy-informed exploration and highlights the practical potential for data-efficient planning in robotics and complex control problems.

Abstract

Model-based Reinforcement Learning (MBRL) allows data-efficient learning which is required in real world applications such as robotics. However, despite the impressive data-efficiency, MBRL does not achieve the final performance of state-of-the-art Model-free Reinforcement Learning (MFRL) methods. We leverage the strengths of both realms and propose an approach that obtains high performance with a small amount of data. In particular, we combine MFRL and Model Predictive Control (MPC). While MFRL's strength in exploration allows us to train a better forward dynamics model for MPC, MPC improves the performance of the MFRL policy by sampling-based planning. The experimental results in standard continuous control benchmarks show that our approach can achieve MFRL`s level of performance while being as data-efficient as MBRL.

Model-based Lookahead Reinforcement Learning

TL;DR

This work addresses the data-efficiency gap between model-free and model-based RL by unifying their strengths in a Model Predictive Control with Model-Free RL (MPC-MFRL) framework. It jointly learns a policy, a value function, and a forward dynamics model, and leverages policy-guided trajectory sampling and value-based trajectory evaluation within MPC, supplemented by a soft-greedy action selection. Empirical results on MuJoCo tasks show that MPC-MFRL achieves state-of-the-art data efficiency while matching or surpassing model-free performance, particularly on challenging tasks like Ant and HalfCheetah. The approach demonstrates robust improvements from policy-informed exploration and highlights the practical potential for data-efficient planning in robotics and complex control problems.

Abstract

Model-based Reinforcement Learning (MBRL) allows data-efficient learning which is required in real world applications such as robotics. However, despite the impressive data-efficiency, MBRL does not achieve the final performance of state-of-the-art Model-free Reinforcement Learning (MFRL) methods. We leverage the strengths of both realms and propose an approach that obtains high performance with a small amount of data. In particular, we combine MFRL and Model Predictive Control (MPC). While MFRL's strength in exploration allows us to train a better forward dynamics model for MPC, MPC improves the performance of the MFRL policy by sampling-based planning. The experimental results in standard continuous control benchmarks show that our approach can achieve MFRL`s level of performance while being as data-efficient as MBRL.

Paper Structure

This paper contains 24 sections, 7 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: Overview of MPC-MFRL at evaluation time: In state $\boldsymbol{s}_t$, MPC-MFRL samples trajectories using an MFRL policy, evaluates sampled trajectories by an MFRL value function, and then chooses an action $\boldsymbol{a}_t$ based on Eq. \ref{['eq::act_select']}. The environment transitions to $\boldsymbol{s}_{t+1}$ and the process starts from the beginning. The upper row illustrates planning in simulation. The lower row depicts interaction with the real environment.
  • Figure 2: Mean and bootstrapped confidence interval (solid lines and error bars, over 5 distinct random seeds) of "Average return" (see Section \ref{['sec:evaluation_procedure']} for evaluation details and definition of "Average return") for different methods. "Num. timestep (M)" is the number of millions of interactions with the environment. Our method MPC-MFRL outperforms comparison methods. For comparison method and evaluation details see Section \ref{['subsec::exp_setup']}.
  • Figure 3: (a) We measure "Average testing error" of a forward dynamics model using a pre-collected testing dataset. Policy indicates collecting data using an MFRL policy, while Random+MPC represents uniform random exploration with on-policy data aggregation nagabandi2017neural; (b) The evaluation results of MPC-MFRL with different training schemes: MPC-MFRL (Policy) is the original MPC-MFRL, while MPC-MFRL (Random+MPC) trains the forward dynamics model using data from Random+MPC.The remaining legends are the same as Fig. \ref{['fig::overall_perf']}.
  • Figure 4: (a) Varying trajectory sampling methods: MPC-MFRL ($\mathcal{Z} = \pi$) and MPC-MFRL ($\mathcal{Z} = U$) respectively denote the original MPC-MFRL and MPC-MFRL that replaces the MFRL policy with an uniform distribution for trajectory sampling. See Fig. \ref{['fig::overall_perf']} for details on the notation in the figure; (b) The evaluation results of different trajectory evaluation methods and planning horizons: MPC-MFRL $(R_\phi(s_t) = V(s_t), H = 2)$, for instance, indicates MPC-MFRL with $R_\phi(s_t) = V(s_t)$ and planning horizon $H = 2$; See Fig. \ref{['fig::overall_perf']} for more details on figure notation.
  • Figure 5: (a) The evaluation results of different action selection approaches: MPC-MFRL (w / SG) indicates the original MPC-MFRL, while MPC-MFRL (w/o SG) represents MPC-MFRL withoug soft-greedy action selection. The rest of legends are identical to Fig. \ref{['fig::overall_perf']}; (b) Mean and bootstrapped confidence interval (bold bars and error bars, over 5 distinct random seeds) of performance for action selection approaches with different forward dynamics model complexities. "Num. hidden units" denotes the number of hidden units used. For evaluation details, see Section \ref{['sec:evaluation_procedure']}