Efficient Offline Reinforcement Learning: First Imitate, then Improve

Adam Jelley; Trevor McInroe; Sam Devlin; Amos Storkey

Efficient Offline Reinforcement Learning: First Imitate, then Improve

Adam Jelley, Trevor McInroe, Sam Devlin, Amos Storkey

TL;DR

This work addresses the inefficiency and instability of pure offline TD learning by proposing a two-stage pretraining pipeline: first, initialize the actor via behavior cloning and the critic via Monte-Carlo value targets derived from the offline data; second, refine with off-policy reinforcement learning. The authors provide a theoretical bound showing that better initialization reduces the number of fitted-Q iterations, particularly for long-horizon tasks, and they generalize the method to entropy-regularized, maximum-entropy offline RL. Empirically, pretraining accelerates convergence and enhances stability on D4RL MuJoCo benchmarks across multiple datasets, with LayerNorm further stabilizing training. The approach offers a practical, minimally invasive recipe for boosting offline RL performance, bridging the gap between imitation learning and TD-based optimization at scale.

Abstract

Supervised imitation-based approaches are often favored over off-policy reinforcement learning approaches for learning policies offline, since their straightforward optimization objective makes them computationally efficient and stable to train. However, their performance is fundamentally limited by the behavior policy that collected the dataset. Off-policy reinforcement learning provides a promising approach for improving on the behavior policy, but training is often computationally inefficient and unstable due to temporal-difference bootstrapping. In this paper, we propose a best-of-both approach by pre-training with supervised learning before improving performance with off-policy reinforcement learning. Specifically, we demonstrate improved efficiency by pre-training an actor with behavior cloning and a critic with a supervised Monte-Carlo value error. We find that we are able to substantially improve the training time of popular off-policy algorithms on standard benchmarks, and also achieve greater stability. Code is available at: https://github.com/AdamJelley/EfficientOfflineRL

Efficient Offline Reinforcement Learning: First Imitate, then Improve

TL;DR

Abstract

Paper Structure (28 sections, 13 equations, 10 figures, 5 tables, 2 algorithms)

This paper contains 28 sections, 13 equations, 10 figures, 5 tables, 2 algorithms.

Introduction
Preliminaries and Related Work
Motivational Example
Theoretical Analysis
Pretraining Off-Policy Reinforcement Learning Algorithms in Practice
Outline Procedure
Bias-Variance and Optimism-Pessimism Trade-offs
Generalization to Maximum Entropy Off-Policy RL Algorithms
Experiments on D4RL MuJoCo
Implementation Details
Results and Analysis
Conclusion
Rational for Separation of Actor and Critic Pre-training for Entropy-Regularized Reinforcement Learning
Investigation Into Affect of LayerNorm
Ablations of Actor and Critic Pre-Training
...and 13 more sections

Figures (10)

Figure 1: A motivational tabular MDP. In offline reinforcement learning, we are provided with a dataset of trajectories. In this paper we utilize information from the entire trajectory (in the form of MC returns) to initialize a critic for subsequent off-policy reinforcement learning, which eliminates much of the inefficiency and instability associated with bootstrapping in temporal difference losses.
Figure 2: Performance means with shaded 95% confidence intervals across 3 independent seeds. Supervised pre-training before offline reinforcement learning is more efficient than offline reinforcement learning from scratch. Surprisingly, performance is often more stable long after pre-training.
Figure 3: Investigation into the affect of adding LayerNorm to both the actor and critic networks for TD3+BC on HalfCheetah and Hopper-medium. All lines show mean and standard deviation in normalized return at each timestep over 3 seeds.
Figure 4: The effect of adding LayerNorm to both actor+critic across environments. All lines show mean and standard deviation in normalized return at each timestep over 3 seeds.
Figure 5: Ablation of actor and critic pre-training. All lines show mean and standard deviation in normalized return at each timestep over 3 seeds.
...and 5 more figures

Efficient Offline Reinforcement Learning: First Imitate, then Improve

TL;DR

Abstract

Efficient Offline Reinforcement Learning: First Imitate, then Improve

Authors

TL;DR

Abstract

Table of Contents

Figures (10)