Table of Contents
Fetching ...

Model-Based Offline Planning

Arthur Argenson, Gabriel Dulac-Arnold

TL;DR

MBOP introduces a data-efficient, planning-based approach for offline reinforcement learning by learning a world model, a behavior-cloning prior, and a fixed-horizon value function, and then performing MPC/MPPI-style planning to control from logs alone. The method enables goal-conditioned and constraint-aware control without environment interaction and shows strong performance on RL-Unplugged and D4RL benchmarks, including near-optimal policies from only minutes of data and zero-shot adaptation to new tasks. Ablation studies confirm the necessity of the BC prior, value function, and learned dynamics for robust planning, while execution-speed analyses guide practical deployment considerations. Overall, MBOP offers a configurable, interpretable alternative to model-free offline learning with practical applicability to robotics and real-world control systems.

Abstract

Offline learning is a key part of making reinforcement learning (RL) useable in real systems. Offline RL looks at scenarios where there is data from a system's operation, but no direct access to the system when learning a policy. Recent work on training RL policies from offline data has shown results both with model-free policies learned directly from the data, or with planning on top of learnt models of the data. Model-free policies tend to be more performant, but are more opaque, harder to command externally, and less easy to integrate into larger systems. We propose an offline learner that generates a model that can be used to control the system directly through planning. This allows us to have easily controllable policies directly from data, without ever interacting with the system. We show the performance of our algorithm, Model-Based Offline Planning (MBOP) on a series of robotics-inspired tasks, and demonstrate its ability leverage planning to respect environmental constraints. We are able to find near-optimal polices for certain simulated systems from as little as 50 seconds of real-time system interaction, and create zero-shot goal-conditioned policies on a series of environments. An accompanying video can be found here: https://youtu.be/nxGGHdZOFts

Model-Based Offline Planning

TL;DR

MBOP introduces a data-efficient, planning-based approach for offline reinforcement learning by learning a world model, a behavior-cloning prior, and a fixed-horizon value function, and then performing MPC/MPPI-style planning to control from logs alone. The method enables goal-conditioned and constraint-aware control without environment interaction and shows strong performance on RL-Unplugged and D4RL benchmarks, including near-optimal policies from only minutes of data and zero-shot adaptation to new tasks. Ablation studies confirm the necessity of the BC prior, value function, and learned dynamics for robust planning, while execution-speed analyses guide practical deployment considerations. Overall, MBOP offers a configurable, interpretable alternative to model-free offline learning with practical applicability to robotics and real-world control systems.

Abstract

Offline learning is a key part of making reinforcement learning (RL) useable in real systems. Offline RL looks at scenarios where there is data from a system's operation, but no direct access to the system when learning a policy. Recent work on training RL policies from offline data has shown results both with model-free policies learned directly from the data, or with planning on top of learnt models of the data. Model-free policies tend to be more performant, but are more opaque, harder to command externally, and less easy to integrate into larger systems. We propose an offline learner that generates a model that can be used to control the system directly through planning. This allows us to have easily controllable policies directly from data, without ever interacting with the system. We show the performance of our algorithm, Model-Based Offline Planning (MBOP) on a series of robotics-inspired tasks, and demonstrate its ability leverage planning to respect environmental constraints. We are able to find near-optimal polices for certain simulated systems from as little as 50 seconds of real-time system interaction, and create zero-shot goal-conditioned policies on a series of environments. An accompanying video can be found here: https://youtu.be/nxGGHdZOFts

Paper Structure

This paper contains 21 sections, 1 equation, 9 figures, 11 tables, 2 algorithms.

Figures (9)

  • Figure 1: Performance of MBOP on various RLU and D4RL datasets. For each of the above tasks we have sub-sampled subsets of the original dataset to obtain the desired number of data points. The subsets are the same throughout the paper. The box plots describe the first quartile of the dataset, with the whiskers extending out to the full distribution, with outliers plotted individually, using the standard Seaborn (more info https://seaborn.pydata.org/generated/seaborn.boxplot.html).
  • Figure 2: The above figures describe performance of MBOP on constrained & goal-conditioned tasks. Fig. \ref{['fig:cartpole_frames']} illustrates a sequences of frames from the RLU Cartpole task with constrained and unconstrained MBOP controllers. In the constrained cases MBOP prevents the cart from crossing the middle of the rail (dotted red line) and contains it to one side. Fig. \ref{['fig:cartpole_constrained']} displays cart trajectories for constrained and unconstrained versions of the same controller. MBOP can maintain a performant policy (above $750$) while respecting these constraints. Fig. \ref{['fig:quadruped_constrained']} displays goal-conditioned performance on the RLU Quadruped. We ignore the original reward function and optimize directly for trajectories that maximize a particular velocity vector. Although influence from $f_B$ and $f_R$ biases the controller to maintain forward direction, we can still exert significant goal-directed influence on the policy.
  • Figure 3: Ablation results on multi-sized datasets form RLU and D4RL.
  • Figure 4: Performance on D4RL tasks from MBOP.
  • Figure 5: Effects of constraints on MBOP performance.
  • ...and 4 more figures