Table of Contents
Fetching ...

Hybrid Reinforcement Learning Breaks Sample Size Barriers in Linear MDPs

Kevin Tan, Wei Fan, Yuting Wei

TL;DR

This paper addresses sample efficiency in hybrid RL without relying on single-policy concentrability by studying linear MDPs and proposing two algorithms: RAPPEL (online-to-offline reward-agnostic exploration followed by offline pessimistic RL) and HYRULE (offline-to-online warm-started online RL). It introduces a partitioned analysis into online and offline components with dimensions $d_{ ext{on}}$ and $d_{ ext{off}}$ and partial concentrability coefficients $c_{ ext{off}}(\mathcal{X}_{\text{off}})$ and $c_{ ext{on}}(\mathcal{X}_{\text{on}})$, achieving $H^3$ horizon scaling and improved dimension dependence. The results show that hybrid RL can match or exceed offline-only and online-only rates in linear MDPs, providing the strongest theoretical guarantees to date for this setting, including a reward-agnostic exploration strategy that yields fixed policies and robust offline data for multiple rewards. Numerical experiments on a Tetris-like environment illustrate practical gains in both the offline-to-online and online-to-offline regimes, highlighting the potential of reward-agnostic exploration and warm-starting to improve sample efficiency in real-world tasks.

Abstract

Hybrid Reinforcement Learning (RL), where an agent learns from both an offline dataset and online explorations in an unknown environment, has garnered significant recent interest. A crucial question posed by Xie et al. (2022) is whether hybrid RL can improve upon the existing lower bounds established in purely offline and purely online RL without relying on the single-policy concentrability assumption. While Li et al. (2023) provided an affirmative answer to this question in the tabular PAC RL case, the question remains unsettled for both the regret-minimizing RL case and the non-tabular case. In this work, building upon recent advancements in offline RL and reward-agnostic exploration, we develop computationally efficient algorithms for both PAC and regret-minimizing RL with linear function approximation, without single-policy concentrability. We demonstrate that these algorithms achieve sharper error or regret bounds that are no worse than, and can improve on, the optimal sample complexity in offline RL (the first algorithm, for PAC RL) and online RL (the second algorithm, for regret-minimizing RL) in linear Markov decision processes (MDPs), regardless of the quality of the behavior policy. To our knowledge, this work establishes the tightest theoretical guarantees currently available for hybrid RL in linear MDPs.

Hybrid Reinforcement Learning Breaks Sample Size Barriers in Linear MDPs

TL;DR

This paper addresses sample efficiency in hybrid RL without relying on single-policy concentrability by studying linear MDPs and proposing two algorithms: RAPPEL (online-to-offline reward-agnostic exploration followed by offline pessimistic RL) and HYRULE (offline-to-online warm-started online RL). It introduces a partitioned analysis into online and offline components with dimensions and and partial concentrability coefficients and , achieving horizon scaling and improved dimension dependence. The results show that hybrid RL can match or exceed offline-only and online-only rates in linear MDPs, providing the strongest theoretical guarantees to date for this setting, including a reward-agnostic exploration strategy that yields fixed policies and robust offline data for multiple rewards. Numerical experiments on a Tetris-like environment illustrate practical gains in both the offline-to-online and online-to-offline regimes, highlighting the potential of reward-agnostic exploration and warm-starting to improve sample efficiency in real-world tasks.

Abstract

Hybrid Reinforcement Learning (RL), where an agent learns from both an offline dataset and online explorations in an unknown environment, has garnered significant recent interest. A crucial question posed by Xie et al. (2022) is whether hybrid RL can improve upon the existing lower bounds established in purely offline and purely online RL without relying on the single-policy concentrability assumption. While Li et al. (2023) provided an affirmative answer to this question in the tabular PAC RL case, the question remains unsettled for both the regret-minimizing RL case and the non-tabular case. In this work, building upon recent advancements in offline RL and reward-agnostic exploration, we develop computationally efficient algorithms for both PAC and regret-minimizing RL with linear function approximation, without single-policy concentrability. We demonstrate that these algorithms achieve sharper error or regret bounds that are no worse than, and can improve on, the optimal sample complexity in offline RL (the first algorithm, for PAC RL) and online RL (the second algorithm, for regret-minimizing RL) in linear Markov decision processes (MDPs), regardless of the quality of the behavior policy. To our knowledge, this work establishes the tightest theoretical guarantees currently available for hybrid RL in linear MDPs.
Paper Structure (40 sections, 24 theorems, 138 equations, 3 figures, 2 tables, 5 algorithms)

This paper contains 40 sections, 24 theorems, 138 equations, 3 figures, 2 tables, 5 algorithms.

Key Result

Lemma 1

For any partition ${\mathcal{X}}_{\operatorname{off}}, {\mathcal{X}}_{\operatorname{on}}$, it satisfies that $c_{\operatorname{on}}({\mathcal{X}}_{\operatorname{on}}) \leq d_{\operatorname{on}}$. Also, there exists at least one partition such that $c_{\operatorname{off}}({\mathcal{X}}_{\operatorname

Figures (3)

  • Figure 1: Coverage achieved by OPTCOV with 200 trajectories of offline data collected under a uniform and an adversarial behavior policy, and with no offline data. Results averaged over $30$ trials, with the shaded area depicting $1.96$-standard errors. Lower is better.
  • Figure 2: Value of policies learned by applying LinPEVI-ADV to the hybrid, offline, and online datasets, with an adversarial behavior policy. The reward is negative as it is the negative of the excess height. Results over $30$ trials. Higher is better.
  • Figure 3: Comparison of LSVI-UCB++ and Algorithm \ref{['alg:hyrule']}. Results averaged over 10 trials, with $1$-standard deviation error bars over 10 trials.

Theorems & Definitions (41)

  • Definition 1: Occupancy Measure
  • Definition 2: Concentrability Coefficient
  • Lemma 1: Partial Coverability Is Bounded In Linear MDPs
  • Theorem 1: Error Bound for RAPPEL, Algorithm \ref{['alg:rappel']}
  • Corollary 1
  • Theorem 2: Regret Bound for HYRULE, Algorithm \ref{['alg:hyrule']}
  • Corollary 2
  • Lemma 2: General Statistical Guarantee for RAPPEL, Algorithm \ref{['alg:rappel']}
  • proof
  • Lemma 3: First Error Bound for RAPPEL, Algorithm \ref{['alg:rappel']}
  • ...and 31 more