The Three Regimes of Offline-to-Online Reinforcement Learning
Lu Li, Tianwei Ni, Yihao Sun, Pierre-Luc Bacon
TL;DR
The paper tackles why offline-to-online RL exhibits inconsistent fine-tuning results by proposing a stability–plasticity principle that balances preserving prior knowledge with adapting to new data. It introduces a three-regime taxonomy (Superior, Comparable, Inferior) based on the relative value of the pretrained policy versus the offline dataset, and validates the framework with a large-scale study across 63 settings. The findings show regime-aligned design choices—pi0-centric strategies when the pretrained policy dominates, D-centric strategies when the offline data dominates, and mixed approaches when they are comparable—providing practical guidance for online fine-tuning. The work connects offline-to-online RL to broader stability–plasticity literature and offers a principled, actionable lens to design and analyze fine-tuning methods.
Abstract
Offline-to-online reinforcement learning (RL) has emerged as a practical paradigm that leverages offline datasets for pretraining and online interactions for fine-tuning. However, its empirical behavior is highly inconsistent: design choices of online-fine tuning that work well in one setting can fail completely in another. We propose a stability--plasticity principle that can explain this inconsistency: we should preserve the knowledge of pretrained policy or offline dataset during online fine-tuning, whichever is better, while maintaining sufficient plasticity. This perspective identifies three regimes of online fine-tuning, each requiring distinct stability properties. We validate this framework through a large-scale empirical study, finding that the results strongly align with its predictions in 45 of 63 cases. This work provides a principled framework for guiding design choices in offline-to-online RL based on the relative performance of the offline dataset and the pretrained policy.
