Table of Contents
Fetching ...

The Three Regimes of Offline-to-Online Reinforcement Learning

Lu Li, Tianwei Ni, Yihao Sun, Pierre-Luc Bacon

TL;DR

The paper tackles why offline-to-online RL exhibits inconsistent fine-tuning results by proposing a stability–plasticity principle that balances preserving prior knowledge with adapting to new data. It introduces a three-regime taxonomy (Superior, Comparable, Inferior) based on the relative value of the pretrained policy versus the offline dataset, and validates the framework with a large-scale study across 63 settings. The findings show regime-aligned design choices—pi0-centric strategies when the pretrained policy dominates, D-centric strategies when the offline data dominates, and mixed approaches when they are comparable—providing practical guidance for online fine-tuning. The work connects offline-to-online RL to broader stability–plasticity literature and offers a principled, actionable lens to design and analyze fine-tuning methods.

Abstract

Offline-to-online reinforcement learning (RL) has emerged as a practical paradigm that leverages offline datasets for pretraining and online interactions for fine-tuning. However, its empirical behavior is highly inconsistent: design choices of online-fine tuning that work well in one setting can fail completely in another. We propose a stability--plasticity principle that can explain this inconsistency: we should preserve the knowledge of pretrained policy or offline dataset during online fine-tuning, whichever is better, while maintaining sufficient plasticity. This perspective identifies three regimes of online fine-tuning, each requiring distinct stability properties. We validate this framework through a large-scale empirical study, finding that the results strongly align with its predictions in 45 of 63 cases. This work provides a principled framework for guiding design choices in offline-to-online RL based on the relative performance of the offline dataset and the pretrained policy.

The Three Regimes of Offline-to-Online Reinforcement Learning

TL;DR

The paper tackles why offline-to-online RL exhibits inconsistent fine-tuning results by proposing a stability–plasticity principle that balances preserving prior knowledge with adapting to new data. It introduces a three-regime taxonomy (Superior, Comparable, Inferior) based on the relative value of the pretrained policy versus the offline dataset, and validates the framework with a large-scale study across 63 settings. The findings show regime-aligned design choices—pi0-centric strategies when the pretrained policy dominates, D-centric strategies when the offline data dominates, and mixed approaches when they are comparable—providing practical guidance for online fine-tuning. The work connects offline-to-online RL to broader stability–plasticity literature and offers a principled, actionable lens to design and analyze fine-tuning methods.

Abstract

Offline-to-online reinforcement learning (RL) has emerged as a practical paradigm that leverages offline datasets for pretraining and online interactions for fine-tuning. However, its empirical behavior is highly inconsistent: design choices of online-fine tuning that work well in one setting can fail completely in another. We propose a stability--plasticity principle that can explain this inconsistency: we should preserve the knowledge of pretrained policy or offline dataset during online fine-tuning, whichever is better, while maintaining sufficient plasticity. This perspective identifies three regimes of online fine-tuning, each requiring distinct stability properties. We validate this framework through a large-scale empirical study, finding that the results strongly align with its predictions in 45 of 63 cases. This work provides a principled framework for guiding design choices in offline-to-online RL based on the relative performance of the offline dataset and the pretrained policy.

Paper Structure

This paper contains 27 sections, 10 equations, 9 figures, 15 tables.

Figures (9)

  • Figure 1: Comparison between WSRL (pretrained policy only) and RLPD (offline dataset only) on two representative offline-to-online RL tasks. All learning curves are shown as mean $\pm$ 95% CI.
  • Figure 2: Overview of the three regimes in offline-to-online RL, defined based on the relative performance of the pretrained policy $J(\pi_0)$ and the offline dataset $J(\pi_\mathcal{D})$. For each regime, our framework indicates which property is most needed during fine-tuning. The boxes at the right show representative design choices that implement these enhancing stability or plasticity. Dashed arrows denote weaker connections than solid arrows.
  • Figure 3: Representative fine-tuning results in the Superior regime: the first row and first two subplots in the second row are correct predictions, while the remaining two show an adjacent mismatch and an opposite mismatch. Markers on the curves indicate the better-performing variant within $\pi_{0}$-centric methods and within $\mathcal{D}$-centric methods.
  • Figure 4: Representative results in the Inferior regime: the first six results are correct predictions, while the remaining two show an adjacent mismatch and an opposite mismatch.
  • Figure 5: Representative fine-tuning results for the Comparable regime. The first two subplots illustrate cases consistent with our framework’s predictions, while the latter two show mismatches with only small mean differences.
  • ...and 4 more figures