Table of Contents
Fetching ...

SAMG: Offline-to-Online Reinforcement Learning via State-Action-Conditional Offline Model Guidance

Liyu Zhang, Haochi Wu, Xu Wan, Quan Kong, Ruilong Deng, Mingyang Sun

TL;DR

SAMG tackles inefficiency in offline-to-online RL by freezing a pre-trained offline critic and guiding online refinement through a state-action-conditional coefficient that weighs offline guidance. The method integrates a frozen Q^{off} with the online Q via a data-driven coefficient p(s,a) learned from a conditional VAE, enabling 100% online data usage and avoiding reliance on offline data during online learning. The authors provide contraction-based theoretical analysis showing convergence to the online optimal policy and faster convergence on in-distribution samples, plus empirical wins on D4RL benchmarks. Limitations arise when offline coverage is narrow, suggesting adaptive data-sharing or OOD strategies as future work.

Abstract

Offline-to-online (O2O) reinforcement learning (RL) pre-trains models on offline data and refines policies through online fine-tuning. However, existing O2O RL algorithms typically require maintaining the tedious offline datasets to mitigate the effects of out-of-distribution (OOD) data, which significantly limits their efficiency in exploiting online samples. To address this deficiency, we introduce a new paradigm for O2O RL called State-Action-Conditional Offline \Model Guidance (SAMG). It freezes the pre-trained offline critic to provide compact offline understanding for each state-action sample, thus eliminating the need for retraining on offline data. The frozen offline critic is incorporated with the online target critic weighted by a state-action-adaptive coefficient. This coefficient aims to capture the offline degree of samples at the state-action level, and is updated adaptively during training. In practice, SAMG could be easily integrated with Q-function-based algorithms. Theoretical analysis shows good optimality and lower estimation error. Empirically, SAMG outperforms state-of-the-art O2O RL algorithms on the D4RL benchmark.

SAMG: Offline-to-Online Reinforcement Learning via State-Action-Conditional Offline Model Guidance

TL;DR

SAMG tackles inefficiency in offline-to-online RL by freezing a pre-trained offline critic and guiding online refinement through a state-action-conditional coefficient that weighs offline guidance. The method integrates a frozen Q^{off} with the online Q via a data-driven coefficient p(s,a) learned from a conditional VAE, enabling 100% online data usage and avoiding reliance on offline data during online learning. The authors provide contraction-based theoretical analysis showing convergence to the online optimal policy and faster convergence on in-distribution samples, plus empirical wins on D4RL benchmarks. Limitations arise when offline coverage is narrow, suggesting adaptive data-sharing or OOD strategies as future work.

Abstract

Offline-to-online (O2O) reinforcement learning (RL) pre-trains models on offline data and refines policies through online fine-tuning. However, existing O2O RL algorithms typically require maintaining the tedious offline datasets to mitigate the effects of out-of-distribution (OOD) data, which significantly limits their efficiency in exploiting online samples. To address this deficiency, we introduce a new paradigm for O2O RL called State-Action-Conditional Offline \Model Guidance (SAMG). It freezes the pre-trained offline critic to provide compact offline understanding for each state-action sample, thus eliminating the need for retraining on offline data. The frozen offline critic is incorporated with the online target critic weighted by a state-action-adaptive coefficient. This coefficient aims to capture the offline degree of samples at the state-action level, and is updated adaptively during training. In practice, SAMG could be easily integrated with Q-function-based algorithms. Theoretical analysis shows good optimality and lower estimation error. Empirically, SAMG outperforms state-of-the-art O2O RL algorithms on the D4RL benchmark.

Paper Structure

This paper contains 38 sections, 4 theorems, 34 equations, 7 figures, 3 tables, 1 algorithm.

Key Result

Theorem 4.1

For a given policy $\pi$, by the TD updating paradigm, $Q_k(s, a)$ of SAMG converges almost surely to $Q^{\pi}(s, a)$ as $k\to \infty$ for all $s \in \mathcal{S}$ and $a \in \mathcal{A}$ if $\sum_k \alpha_k(s, a)=\infty$ and $\sum_k \alpha_k^2(s, a) < \infty$ for all $s \in \mathcal{S}$ and $a \in \

Figures (7)

  • Figure 1: Compared with the vanilla O2O framework (a), the proposed SAMG (b) uses the frozen offline RL model with an adaptive distribution-aware model to achieve online fine-tuning with 100% online sample rates.
  • Figure 2: Architecture of SAMG. This figure illustrates the structure of SAMG, highlighting the transition from offline pre-training to online fine-tuning. It outlines key components, including offline critic, VAE model, offline-guidance technique, and adaptive coefficient.
  • Figure 3: Ablation analysis of the state-action-conditional coefficient.
  • Figure 4: Sensitivity test for the coefficient threshold $p^{vae}_{m}$.
  • Figure 5: Average improvement of the normalized score over the offline data coverage.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Theorem 4.1: Convergence property of SAMG
  • Theorem 4.2: Convergence speed of SAMG
  • Theorem 1.1: Contraction mapping theorem
  • Theorem 1.2: Dvoretzky's Throrem