Efficient Offline Reinforcement Learning: First Imitate, then Improve
Adam Jelley, Trevor McInroe, Sam Devlin, Amos Storkey
TL;DR
This work addresses the inefficiency and instability of pure offline TD learning by proposing a two-stage pretraining pipeline: first, initialize the actor via behavior cloning and the critic via Monte-Carlo value targets derived from the offline data; second, refine with off-policy reinforcement learning. The authors provide a theoretical bound showing that better initialization reduces the number of fitted-Q iterations, particularly for long-horizon tasks, and they generalize the method to entropy-regularized, maximum-entropy offline RL. Empirically, pretraining accelerates convergence and enhances stability on D4RL MuJoCo benchmarks across multiple datasets, with LayerNorm further stabilizing training. The approach offers a practical, minimally invasive recipe for boosting offline RL performance, bridging the gap between imitation learning and TD-based optimization at scale.
Abstract
Supervised imitation-based approaches are often favored over off-policy reinforcement learning approaches for learning policies offline, since their straightforward optimization objective makes them computationally efficient and stable to train. However, their performance is fundamentally limited by the behavior policy that collected the dataset. Off-policy reinforcement learning provides a promising approach for improving on the behavior policy, but training is often computationally inefficient and unstable due to temporal-difference bootstrapping. In this paper, we propose a best-of-both approach by pre-training with supervised learning before improving performance with off-policy reinforcement learning. Specifically, we demonstrate improved efficiency by pre-training an actor with behavior cloning and a critic with a supervised Monte-Carlo value error. We find that we are able to substantially improve the training time of popular off-policy algorithms on standard benchmarks, and also achieve greater stability. Code is available at: https://github.com/AdamJelley/EfficientOfflineRL
