Table of Contents
Fetching ...

Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data

Zhiyuan Zhou, Andy Peng, Qiyang Li, Sergey Levine, Aviral Kumar

TL;DR

This work investigates no-retention online fine-tuning for RL initializations pre-trained via offline RL. It identifies Q-value recalibration and distribution-shift-induced forgetting as the main barriers to discarding offline data during fine-tuning and introduces Warm-start Reinforcement Learning (WSRL), which seeds the online replay buffer with a small number of rollouts from the pre-trained policy to stabilize learning. Through extensive experiments across simulated tasks and a real-world Franka robot, WSRL achieves faster, higher-performing fine-tuning without retaining offline data, and the warmup phase is shown to be essential for preventing the initial deterioration of the pre-trained initialization. The results suggest that offline data retention may be unnecessary for efficient RL fine-tuning, with practical implications for scaling RL to large and diverse offline datasets.

Abstract

The modern paradigm in machine learning involves pre-training on diverse data, followed by task-specific fine-tuning. In reinforcement learning (RL), this translates to learning via offline RL on a diverse historical dataset, followed by rapid online RL fine-tuning using interaction data. Most RL fine-tuning methods require continued training on offline data for stability and performance. However, this is undesirable because training on diverse offline data is slow and expensive for large datasets, and in principle, also limit the performance improvement possible because of constraints or pessimism on offline data. In this paper, we show that retaining offline data is unnecessary as long as we use a properly-designed online RL approach for fine-tuning offline RL initializations. To build this approach, we start by analyzing the role of retaining offline data in online fine-tuning. We find that continued training on offline data is mostly useful for preventing a sudden divergence in the value function at the onset of fine-tuning, caused by a distribution mismatch between the offline data and online rollouts. This divergence typically results in unlearning and forgetting the benefits of offline pre-training. Our approach, Warm-start RL (WSRL), mitigates the catastrophic forgetting of pre-trained initializations using a very simple idea. WSRL employs a warmup phase that seeds the online RL run with a very small number of rollouts from the pre-trained policy to do fast online RL. The data collected during warmup helps ``recalibrate'' the offline Q-function to the online distribution, allowing us to completely discard offline data without destabilizing the online RL fine-tuning. We show that WSRL is able to fine-tune without retaining any offline data, and is able to learn faster and attains higher performance than existing algorithms irrespective of whether they retain offline data or not.

Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data

TL;DR

This work investigates no-retention online fine-tuning for RL initializations pre-trained via offline RL. It identifies Q-value recalibration and distribution-shift-induced forgetting as the main barriers to discarding offline data during fine-tuning and introduces Warm-start Reinforcement Learning (WSRL), which seeds the online replay buffer with a small number of rollouts from the pre-trained policy to stabilize learning. Through extensive experiments across simulated tasks and a real-world Franka robot, WSRL achieves faster, higher-performing fine-tuning without retaining offline data, and the warmup phase is shown to be essential for preventing the initial deterioration of the pre-trained initialization. The results suggest that offline data retention may be unnecessary for efficient RL fine-tuning, with practical implications for scaling RL to large and diverse offline datasets.

Abstract

The modern paradigm in machine learning involves pre-training on diverse data, followed by task-specific fine-tuning. In reinforcement learning (RL), this translates to learning via offline RL on a diverse historical dataset, followed by rapid online RL fine-tuning using interaction data. Most RL fine-tuning methods require continued training on offline data for stability and performance. However, this is undesirable because training on diverse offline data is slow and expensive for large datasets, and in principle, also limit the performance improvement possible because of constraints or pessimism on offline data. In this paper, we show that retaining offline data is unnecessary as long as we use a properly-designed online RL approach for fine-tuning offline RL initializations. To build this approach, we start by analyzing the role of retaining offline data in online fine-tuning. We find that continued training on offline data is mostly useful for preventing a sudden divergence in the value function at the onset of fine-tuning, caused by a distribution mismatch between the offline data and online rollouts. This divergence typically results in unlearning and forgetting the benefits of offline pre-training. Our approach, Warm-start RL (WSRL), mitigates the catastrophic forgetting of pre-trained initializations using a very simple idea. WSRL employs a warmup phase that seeds the online RL run with a very small number of rollouts from the pre-trained policy to do fast online RL. The data collected during warmup helps ``recalibrate'' the offline Q-function to the online distribution, allowing us to completely discard offline data without destabilizing the online RL fine-tuning. We show that WSRL is able to fine-tune without retaining any offline data, and is able to learn faster and attains higher performance than existing algorithms irrespective of whether they retain offline data or not.

Paper Structure

This paper contains 31 sections, 30 figures, 1 table, 1 algorithm.

Figures (30)

  • Figure 1: No data retention fine-tuning focuses on RL fine-tuning without using the offline dataset during online updates, mirroring the common paradigm in machine learning at scale today. The offline dataset is only used to pre-train a policy and Q-function via offline RL to initialize fine-tuning, after which the dataset is discarded and the agent only fine-tunes with online experience. Current methods struggle in this "no-retention" setting and forget knowledge learned from pre-training. Our goal is to develop a fine-tuning method that quickly adapts online even if we do not retain offline data.
  • Figure 2: In no-retention fine-tuning, IQL, CQL, and CalQL all fail to fine-tune on kitchen-partial. In contrast, when continually training on offline data during fine-tuning, these algorithms work as intended. Vertical dotted line indicates the separation between pre-training and fine-tuning.
  • Figure 3: When offline data is removed (to different extents) during fine-tuning, performance drops (subfigure a) because the Q-function fit on offline dataset distribution diverges (subfigure b, c), even though the Q-function can fit the online distribution (subfigure d). This plot shows fine-tuning CalQL on kitchen-partial with 0/5/10/25% offline data in each update batch. We have similar findings with IQL and CQL.
  • Figure 4: A downward spiral effect in CQL (left), CalQL (middle), and IQL (right) Q-functions in no-retention fine-tuning on kitchen-mixed, kitchen-complete, and kitchen-partial: When fine-tuning starts at $500$k steps, Q function goes on a downward spiral. When it eventually recovers, the policy has already unlearned (Figure \ref{['fig:challenge']}).
  • Figure 5: Illustration to demonstrate why Q-values are under-estimated in no-retention fine-tuning and may lead to a "downward spiral".
  • ...and 25 more figures