Table of Contents
Fetching ...

Efficient Offline Reinforcement Learning: First Imitate, then Improve

Adam Jelley, Trevor McInroe, Sam Devlin, Amos Storkey

TL;DR

This work addresses the inefficiency and instability of pure offline TD learning by proposing a two-stage pretraining pipeline: first, initialize the actor via behavior cloning and the critic via Monte-Carlo value targets derived from the offline data; second, refine with off-policy reinforcement learning. The authors provide a theoretical bound showing that better initialization reduces the number of fitted-Q iterations, particularly for long-horizon tasks, and they generalize the method to entropy-regularized, maximum-entropy offline RL. Empirically, pretraining accelerates convergence and enhances stability on D4RL MuJoCo benchmarks across multiple datasets, with LayerNorm further stabilizing training. The approach offers a practical, minimally invasive recipe for boosting offline RL performance, bridging the gap between imitation learning and TD-based optimization at scale.

Abstract

Supervised imitation-based approaches are often favored over off-policy reinforcement learning approaches for learning policies offline, since their straightforward optimization objective makes them computationally efficient and stable to train. However, their performance is fundamentally limited by the behavior policy that collected the dataset. Off-policy reinforcement learning provides a promising approach for improving on the behavior policy, but training is often computationally inefficient and unstable due to temporal-difference bootstrapping. In this paper, we propose a best-of-both approach by pre-training with supervised learning before improving performance with off-policy reinforcement learning. Specifically, we demonstrate improved efficiency by pre-training an actor with behavior cloning and a critic with a supervised Monte-Carlo value error. We find that we are able to substantially improve the training time of popular off-policy algorithms on standard benchmarks, and also achieve greater stability. Code is available at: https://github.com/AdamJelley/EfficientOfflineRL

Efficient Offline Reinforcement Learning: First Imitate, then Improve

TL;DR

This work addresses the inefficiency and instability of pure offline TD learning by proposing a two-stage pretraining pipeline: first, initialize the actor via behavior cloning and the critic via Monte-Carlo value targets derived from the offline data; second, refine with off-policy reinforcement learning. The authors provide a theoretical bound showing that better initialization reduces the number of fitted-Q iterations, particularly for long-horizon tasks, and they generalize the method to entropy-regularized, maximum-entropy offline RL. Empirically, pretraining accelerates convergence and enhances stability on D4RL MuJoCo benchmarks across multiple datasets, with LayerNorm further stabilizing training. The approach offers a practical, minimally invasive recipe for boosting offline RL performance, bridging the gap between imitation learning and TD-based optimization at scale.

Abstract

Supervised imitation-based approaches are often favored over off-policy reinforcement learning approaches for learning policies offline, since their straightforward optimization objective makes them computationally efficient and stable to train. However, their performance is fundamentally limited by the behavior policy that collected the dataset. Off-policy reinforcement learning provides a promising approach for improving on the behavior policy, but training is often computationally inefficient and unstable due to temporal-difference bootstrapping. In this paper, we propose a best-of-both approach by pre-training with supervised learning before improving performance with off-policy reinforcement learning. Specifically, we demonstrate improved efficiency by pre-training an actor with behavior cloning and a critic with a supervised Monte-Carlo value error. We find that we are able to substantially improve the training time of popular off-policy algorithms on standard benchmarks, and also achieve greater stability. Code is available at: https://github.com/AdamJelley/EfficientOfflineRL
Paper Structure (28 sections, 13 equations, 10 figures, 5 tables, 2 algorithms)

This paper contains 28 sections, 13 equations, 10 figures, 5 tables, 2 algorithms.

Figures (10)

  • Figure 1: A motivational tabular MDP. In offline reinforcement learning, we are provided with a dataset of trajectories. In this paper we utilize information from the entire trajectory (in the form of MC returns) to initialize a critic for subsequent off-policy reinforcement learning, which eliminates much of the inefficiency and instability associated with bootstrapping in temporal difference losses.
  • Figure 2: Performance means with shaded 95% confidence intervals across 3 independent seeds. Supervised pre-training before offline reinforcement learning is more efficient than offline reinforcement learning from scratch. Surprisingly, performance is often more stable long after pre-training.
  • Figure 3: Investigation into the affect of adding LayerNorm to both the actor and critic networks for TD3+BC on HalfCheetah and Hopper-medium. All lines show mean and standard deviation in normalized return at each timestep over 3 seeds.
  • Figure 4: The effect of adding LayerNorm to both actor+critic across environments. All lines show mean and standard deviation in normalized return at each timestep over 3 seeds.
  • Figure 5: Ablation of actor and critic pre-training. All lines show mean and standard deviation in normalized return at each timestep over 3 seeds.
  • ...and 5 more figures