Table of Contents
Fetching ...

A Non-Monolithic Policy Approach of Offline-to-Online Reinforcement Learning

JaeYoon Kim, Junyu Xuan, Christy Liang, Farookh Hussain

TL;DR

This research focuses on harmonizing the advantages of the offline policy, termed exploitation, with those of the online policy, referred to as exploration, without modifying the offline policy.

Abstract

Offline-to-online reinforcement learning (RL) leverages both pre-trained offline policies and online policies trained for downstream tasks, aiming to improve data efficiency and accelerate performance enhancement. An existing approach, Policy Expansion (PEX), utilizes a policy set composed of both policies without modifying the offline policy for exploration and learning. However, this approach fails to ensure sufficient learning of the online policy due to an excessive focus on exploration with both policies. Since the pre-trained offline policy can assist the online policy in exploiting a downstream task based on its prior experience, it should be executed effectively and tailored to the specific requirements of the downstream task. In contrast, the online policy, with its immature behavioral strategy, has the potential for exploration during the training phase. Therefore, our research focuses on harmonizing the advantages of the offline policy, termed exploitation, with those of the online policy, referred to as exploration, without modifying the offline policy. In this study, we propose an innovative offline-to-online RL method that employs a non-monolithic exploration approach. Our methodology demonstrates superior performance compared to PEX.

A Non-Monolithic Policy Approach of Offline-to-Online Reinforcement Learning

TL;DR

This research focuses on harmonizing the advantages of the offline policy, termed exploitation, with those of the online policy, referred to as exploration, without modifying the offline policy.

Abstract

Offline-to-online reinforcement learning (RL) leverages both pre-trained offline policies and online policies trained for downstream tasks, aiming to improve data efficiency and accelerate performance enhancement. An existing approach, Policy Expansion (PEX), utilizes a policy set composed of both policies without modifying the offline policy for exploration and learning. However, this approach fails to ensure sufficient learning of the online policy due to an excessive focus on exploration with both policies. Since the pre-trained offline policy can assist the online policy in exploiting a downstream task based on its prior experience, it should be executed effectively and tailored to the specific requirements of the downstream task. In contrast, the online policy, with its immature behavioral strategy, has the potential for exploration during the training phase. Therefore, our research focuses on harmonizing the advantages of the offline policy, termed exploitation, with those of the online policy, referred to as exploration, without modifying the offline policy. In this study, we propose an innovative offline-to-online RL method that employs a non-monolithic exploration approach. Our methodology demonstrates superior performance compared to PEX.

Paper Structure

This paper contains 21 sections, 8 equations, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: Illustration of offline-to-online RL training schemes employed in our model with an unmodified offline policy
  • Figure 2: Normalized Return Curves of different methods, which are our model, PEX, Offline and Buffer, on benchmark tasks from D4RL. IQL is used for all methods as the backbone.
  • Figure 3: Execution count of our model and PEX on benchmark tasks from D4RL. The execution counts of offline policy and online policy of PEX or our model are referred to as $PEX\_Offline$ and $PEX\_Online$ or $OurModel\_Offline$ and $OurModel\_Online$, respectively.