Table of Contents
Fetching ...

Hybrid Transfer Reinforcement Learning: Provable Sample Efficiency from Shifted-Dynamics Data

Chengrui Qu, Laixi Shi, Kishan Panaganti, Pengcheng You, Adam Wierman

TL;DR

This work proposes a hybrid transfer RL (HTRL) setting, where an agent learns in a target environment while accessing offline data from a source environment with shifted dynamics, and designs HySRL, a transfer algorithm that achieves problem-dependent sample complexity and outperforms pure online RL.

Abstract

Online Reinforcement learning (RL) typically requires high-stakes online interaction data to learn a policy for a target task. This prompts interest in leveraging historical data to improve sample efficiency. The historical data may come from outdated or related source environments with different dynamics. It remains unclear how to effectively use such data in the target task to provably enhance learning and sample efficiency. To address this, we propose a hybrid transfer RL (HTRL) setting, where an agent learns in a target environment while accessing offline data from a source environment with shifted dynamics. We show that -- without information on the dynamics shift -- general shifted-dynamics data, even with subtle shifts, does not reduce sample complexity in the target environment. However, with prior information on the degree of the dynamics shift, we design HySRL, a transfer algorithm that achieves problem-dependent sample complexity and outperforms pure online RL. Finally, our experimental results demonstrate that HySRL surpasses state-of-the-art online RL baseline.

Hybrid Transfer Reinforcement Learning: Provable Sample Efficiency from Shifted-Dynamics Data

TL;DR

This work proposes a hybrid transfer RL (HTRL) setting, where an agent learns in a target environment while accessing offline data from a source environment with shifted dynamics, and designs HySRL, a transfer algorithm that achieves problem-dependent sample complexity and outperforms pure online RL.

Abstract

Online Reinforcement learning (RL) typically requires high-stakes online interaction data to learn a policy for a target task. This prompts interest in leveraging historical data to improve sample efficiency. The historical data may come from outdated or related source environments with different dynamics. It remains unclear how to effectively use such data in the target task to provably enhance learning and sample efficiency. To address this, we propose a hybrid transfer RL (HTRL) setting, where an agent learns in a target environment while accessing offline data from a source environment with shifted dynamics. We show that -- without information on the dynamics shift -- general shifted-dynamics data, even with subtle shifts, does not reduce sample complexity in the target environment. However, with prior information on the degree of the dynamics shift, we design HySRL, a transfer algorithm that achieves problem-dependent sample complexity and outperforms pure online RL. Finally, our experimental results demonstrate that HySRL surpasses state-of-the-art online RL baseline.

Paper Structure

This paper contains 61 sections, 21 theorems, 159 equations, 4 figures, 3 algorithms.

Key Result

Theorem 1

Given an optimality gap $\varepsilon$, consider for any $\mathcal{M}_{\mathrm{src}}$ the following set of possible MDPs: where $48\varepsilon/H^2\le \alpha\le 1$. Suppose $S\ge 3$, $H\ge 3$, $A\ge 2$, $\varepsilon\le 1/48$. For any algorithm, there always exists a $\mathcal{M}_{\mathrm{src}}$ and a target MDP $\mathcal{M}_{\mathrm{tar}}\in \mathcal{M}_{\alpha}$, if the number of samples $n$ colle

Figures (4)

  • Figure 1: Comparison between different RL settings
  • Figure 2: \ref{['fig:gap-ci']} shows the optimality gap of HySRL (ours) and BPI-UCBVI as the sample size varies. \ref{['fig:percentage-drop-ci']} presents the percentage optimality gap of HySRL (ours) and BPI-UCBVI as the true $\beta$ varies.
  • Figure 3: Hard MDPs
  • Figure 4: The MDP on the left is the real MDP, and the MDP on the right is the empirical MDP. For any reward function $r(\cdot,\cdot)$ and any policy $\pi$, the simulation error $|Q_1-\hat{Q}_1|$ is zero, while the estimation error can be significant with $\delta$.

Theorems & Definitions (39)

  • Theorem 1: Minimax lower bound for HTRL
  • Definition 1: $\beta$-separable shift
  • Remark 1: Separable shift makes HTRL feasible
  • Definition 2: Shifted region
  • Lemma 1: Sample-efficient shift identification
  • Theorem 2: Problem-dependent sample complexity
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • ...and 29 more