Table of Contents
Fetching ...

Hybrid Reinforcement Learning from Offline Observation Alone

Yuda Song, J. Andrew Bagnell, Aarti Singh

TL;DR

The paper studies Hybrid RL when offline data provides only observations, introducing HyRLO and distinguishing trace-model access from reset-model access. It formalizes admissibility as a condition under which offline state distributions can be realized by some policy, and develops a two-phase algorithm, Foobar, that combines forward state-moment matching (Fail) with backward Psdp-trace refinement to achieve performance comparable to policies covered by the offline data. Theoretical guarantees are provided under admissibility and Bellman-completeness, with Foobar achieving competitive regret bounds and sample complexities that scale with the offline coverage and horizon, without relying on bilinear MDP structure. Empirically, Foobar attains strong performance on complex tasks like combination locks and high-dimensional hammer manipulation, while exhibiting robustness to certain inadmissible offline distributions. Overall, the work advances practical HyRLO by enabling decision-making from state-only offline data and offering a principled route to leverage minimal offline signals in conjunction with online interaction.

Abstract

We consider the hybrid reinforcement learning setting where the agent has access to both offline data and online interactive access. While Reinforcement Learning (RL) research typically assumes offline data contains complete action, reward and transition information, datasets with only state information (also known as observation-only datasets) are more general, abundant and practical. This motivates our study of the hybrid RL with observation-only offline dataset framework. While the task of competing with the best policy "covered" by the offline data can be solved if a reset model of the environment is provided (i.e., one that can be reset to any state), we show evidence of hardness when only given the weaker trace model (i.e., one can only reset to the initial states and must produce full traces through the environment), without further assumption of admissibility of the offline data. Under the admissibility assumptions -- that the offline data could actually be produced by the policy class we consider -- we propose the first algorithm in the trace model setting that provably matches the performance of algorithms that leverage a reset model. We also perform proof-of-concept experiments that suggest the effectiveness of our algorithm in practice.

Hybrid Reinforcement Learning from Offline Observation Alone

TL;DR

The paper studies Hybrid RL when offline data provides only observations, introducing HyRLO and distinguishing trace-model access from reset-model access. It formalizes admissibility as a condition under which offline state distributions can be realized by some policy, and develops a two-phase algorithm, Foobar, that combines forward state-moment matching (Fail) with backward Psdp-trace refinement to achieve performance comparable to policies covered by the offline data. Theoretical guarantees are provided under admissibility and Bellman-completeness, with Foobar achieving competitive regret bounds and sample complexities that scale with the offline coverage and horizon, without relying on bilinear MDP structure. Empirically, Foobar attains strong performance on complex tasks like combination locks and high-dimensional hammer manipulation, while exhibiting robustness to certain inadmissible offline distributions. Overall, the work advances practical HyRLO by enabling decision-making from state-only offline data and offering a principled route to leverage minimal offline signals in conjunction with online interaction.

Abstract

We consider the hybrid reinforcement learning setting where the agent has access to both offline data and online interactive access. While Reinforcement Learning (RL) research typically assumes offline data contains complete action, reward and transition information, datasets with only state information (also known as observation-only datasets) are more general, abundant and practical. This motivates our study of the hybrid RL with observation-only offline dataset framework. While the task of competing with the best policy "covered" by the offline data can be solved if a reset model of the environment is provided (i.e., one that can be reset to any state), we show evidence of hardness when only given the weaker trace model (i.e., one can only reset to the initial states and must produce full traces through the environment), without further assumption of admissibility of the offline data. Under the admissibility assumptions -- that the offline data could actually be produced by the policy class we consider -- we propose the first algorithm in the trace model setting that provably matches the performance of algorithms that leverage a reset model. We also perform proof-of-concept experiments that suggest the effectiveness of our algorithm in practice.
Paper Structure (41 sections, 17 theorems, 55 equations, 7 figures, 6 tables, 7 algorithms)

This paper contains 41 sections, 17 theorems, 55 equations, 7 figures, 6 tables, 7 algorithms.

Key Result

Proposition 1

For any algorithm $\mathsf{Alg}$, denote the dataset collected by $\mathsf{Alg}$ as $D^{\mathsf{Alg}}$, and let $\widehat{D}$ denote the empirical distribution of a dataset $D$. Then there exists an MDP $\mathcal{M}$ with deterministic transition and a set of offline datasets $\{\mathcal{D}_h\}$, wi However, there exists an algorithm $\mathsf{Alg}^{\mathsf{reset}}$ that uses any offline dataset $D

Figures (7)

  • Figure 1: Comparison with hybrid RL and online RL. Left: evaluation curve along the training process in the combination lock task. The plot for Foobar combines the forward and backward passes: during the forward pass, the evaluation result is from all the forward policies (trained and untrained). During the backward pass, after training at horizon $h$, the evaluation is from the policy $\pi^\mathsf{f} \circ_h \pi^\mathsf{b}$. Right: evaluation curve along the training process in the hammer-binary task. The plot for Foobar shows the performance of the stationary backward policy in the backward phase. We repeat the experiment for 10 random seeds and plot the median and 25% to 75% percentiles.
  • Figure 2: Construction for \ref{['prop:hard_data']}. The blue notes correspond to the offline data's coverage of the optimal policy. The orange note corresponds to the inadmissible part of the offline data.
  • Figure 3: Construction for \ref{['prop:hard_tv']}. The blue transition corresponds to the dynamics after taking the action $a_1$, and the orange transition corresponds to the dynamics after taking the action $a_2$. The red node denotes the node with rewards.
  • Figure 4: Visualization of the environment. Left: combination lock. Right: hammer. The left figure is reproduced from zhang2022efficient with permission from the authors.
  • Figure 5: Zoomed-in training curve of Foobar.
  • ...and 2 more figures

Theorems & Definitions (22)

  • Proposition 1
  • Proposition 2
  • Theorem 1: Guarantee of \ref{['alg:fail']}
  • Theorem 2
  • Remark 1: Reduction from trace to reset
  • Remark 2: Removing explicit structural assumptions
  • Remark 3: Significance of the discriminator class
  • Proposition 3
  • Proposition 4
  • Lemma 1: Guarantee of \ref{['alg:minmax']} (Theorem 3.1 of sun2019provably)
  • ...and 12 more