Table of Contents
Fetching ...

Leveraging Offline Data in Online Reinforcement Learning

Andrew Wagenmaker, Aldo Pacchiano

TL;DR

This paper introduces FineTuneRL, an intermediate reinforcement learning setting where an offline dataset is augmented by online environment interaction to minimize online sample complexity in linear MDPs. It defines the Offline-to-Online Concentrability coefficient and proposes FTPedel, a Pedel-based algorithm that achieves ε-optimal policies with online samples close to the minimal online requirements implied by the coefficient, up to lower-order terms. The work also demonstrates that offline data can provably improve learning efficiency over either purely online or purely offline approaches and formalizes a verifiability framework distinguishing verifiable from unverifiable RL. Together, these results establish a principled foundation for leveraging offline data to accelerate online RL and open avenues for extending the theory to broader settings and objectives.

Abstract

Two central paradigms have emerged in the reinforcement learning (RL) community: online RL and offline RL. In the online RL setting, the agent has no prior knowledge of the environment, and must interact with it in order to find an $ε$-optimal policy. In the offline RL setting, the learner instead has access to a fixed dataset to learn from, but is unable to otherwise interact with the environment, and must obtain the best policy it can from this offline data. Practical scenarios often motivate an intermediate setting: if we have some set of offline data and, in addition, may also interact with the environment, how can we best use the offline data to minimize the number of online interactions necessary to learn an $ε$-optimal policy? In this work, we consider this setting, which we call the \textsf{FineTuneRL} setting, for MDPs with linear structure. We characterize the necessary number of online samples needed in this setting given access to some offline dataset, and develop an algorithm, \textsc{FTPedel}, which is provably optimal, up to $H$ factors. We show through an explicit example that combining offline data with online interactions can lead to a provable improvement over either purely offline or purely online RL. Finally, our results illustrate the distinction between \emph{verifiable} learning, the typical setting considered in online RL, and \emph{unverifiable} learning, the setting often considered in offline RL, and show that there is a formal separation between these regimes.

Leveraging Offline Data in Online Reinforcement Learning

TL;DR

This paper introduces FineTuneRL, an intermediate reinforcement learning setting where an offline dataset is augmented by online environment interaction to minimize online sample complexity in linear MDPs. It defines the Offline-to-Online Concentrability coefficient and proposes FTPedel, a Pedel-based algorithm that achieves ε-optimal policies with online samples close to the minimal online requirements implied by the coefficient, up to lower-order terms. The work also demonstrates that offline data can provably improve learning efficiency over either purely online or purely offline approaches and formalizes a verifiability framework distinguishing verifiable from unverifiable RL. Together, these results establish a principled foundation for leveraging offline data to accelerate online RL and open avenues for extending the theory to broader settings and objectives.

Abstract

Two central paradigms have emerged in the reinforcement learning (RL) community: online RL and offline RL. In the online RL setting, the agent has no prior knowledge of the environment, and must interact with it in order to find an -optimal policy. In the offline RL setting, the learner instead has access to a fixed dataset to learn from, but is unable to otherwise interact with the environment, and must obtain the best policy it can from this offline data. Practical scenarios often motivate an intermediate setting: if we have some set of offline data and, in addition, may also interact with the environment, how can we best use the offline data to minimize the number of online interactions necessary to learn an -optimal policy? In this work, we consider this setting, which we call the \textsf{FineTuneRL} setting, for MDPs with linear structure. We characterize the necessary number of online samples needed in this setting given access to some offline dataset, and develop an algorithm, \textsc{FTPedel}, which is provably optimal, up to factors. We show through an explicit example that combining offline data with online interactions can lead to a provable improvement over either purely offline or purely online RL. Finally, our results illustrate the distinction between \emph{verifiable} learning, the typical setting considered in online RL, and \emph{unverifiable} learning, the setting often considered in offline RL, and show that there is a formal separation between these regimes.
Paper Structure (43 sections, 37 theorems, 189 equations, 5 algorithms)

This paper contains 43 sections, 37 theorems, 189 equations, 5 algorithms.

Key Result

Proposition 1

Fix $\epsilon > 0$. Then there exists some choice of $\eta$ and set of parameter vectors $\mathcal{W}$ such that the set of linear softmax policies defined with $\eta$ and over the set $\mathcal{W}$, $\Pi_{\mathsf{lsm}}$, is guaranteed to contain an $\epsilon$-optimal policy on any linear MDP.

Theorems & Definitions (80)

  • Definition 3.1: Linear MDPs jin2020provably
  • Definition 3.2: Linear Softmax Policy
  • Proposition 1: Lemma A.14 of wagenmaker2022instance
  • Definition 4.1: Offline-to-Online Concentrability Coefficient
  • Definition 4.2: Minimal Online Samples for Coverage
  • Theorem 1
  • Theorem 2
  • Corollary 1
  • Remark 4.1: Comparison to song2022hybrid
  • Remark 4.2: Scaling of $T_{\mathsf{o2o}}^h$ and $\epsilon$ Dependence
  • ...and 70 more