Bayesian Design Principles for Offline-to-Online Reinforcement Learning

Hao Hu; Yiqin Yang; Jianing Ye; Chengjie Wu; Ziqing Mai; Yujing Hu; Tangjie Lv; Changjie Fan; Qianchuan Zhao; Chongjie Zhang

Bayesian Design Principles for Offline-to-Online Reinforcement Learning

Hao Hu, Yiqin Yang, Jianing Ye, Chengjie Wu, Ziqing Mai, Yujing Hu, Tangjie Lv, Changjie Fan, Qianchuan Zhao, Chongjie Zhang

TL;DR

The paper addresses the offline-to-online reinforcement learning challenge, where purely offline or purely online strategies underperform due to the optimistic-pessimistic dilemma. It proposes a probability-matching, Bayesian design that samples from the posterior over policies to balance information gain and offline data reuse, supported by information-theoretic analysis and linear-MDP bounds. The authors introduce BOORL, a two-phase algorithm that applies bootstrapped offline posterior estimation and posterior sampling during online interaction, achieving robust improvements and compatibility with existing offline RL methods. Theoretical results yield both online and offline regret bounds, and empirical evaluations on Bernoulli bandits and D4RL benchmarks demonstrate superior performance and stability during offline-to-online transitions. This Bayesian, information-theoretic perspective offers a principled framework for efficient, safe, and scalable offline-to-online RL.

Abstract

Offline reinforcement learning (RL) is crucial for real-world applications where exploration can be costly or unsafe. However, offline learned policies are often suboptimal, and further online fine-tuning is required. In this paper, we tackle the fundamental dilemma of offline-to-online fine-tuning: if the agent remains pessimistic, it may fail to learn a better policy, while if it becomes optimistic directly, performance may suffer from a sudden drop. We show that Bayesian design principles are crucial in solving such a dilemma. Instead of adopting optimistic or pessimistic policies, the agent should act in a way that matches its belief in optimal policies. Such a probability-matching agent can avoid a sudden performance drop while still being guaranteed to find the optimal policy. Based on our theoretical findings, we introduce a novel algorithm that outperforms existing methods on various benchmarks, demonstrating the efficacy of our approach. Overall, the proposed approach provides a new perspective on offline-to-online RL that has the potential to enable more effective learning from offline data.

Bayesian Design Principles for Offline-to-Online Reinforcement Learning

TL;DR

Abstract

Paper Structure (46 sections, 11 theorems, 52 equations, 9 figures, 7 tables, 5 algorithms)

This paper contains 46 sections, 11 theorems, 52 equations, 9 figures, 7 tables, 5 algorithms.

Introduction
Related Works
Offline RL.
Offline-to-Online RL.
Bayesian RL and Information-Theoretic Analysis.
Preliminaries
Episodic Reinforcement Learning
Linear Function Approximation
Information Gain and Bayesian Learning
Theoretical Analysis
Information-Theoretic Analysis
Specification in Linear MDPs
Method
Offline Phase.
Online Phase.
...and 31 more sections

Key Result

Theorem 4

Then the per-episode regret of Thompson Sampling and UCB agents satisfies where $a_{k,h}\sim \pi_{k,h}$. Similarly, the per-episode regret of Thompson Sampling and LCB agents satisfies where $a_{h}^*\sim \pi_{h}^*$.

Figures (9)

Figure 1: Fine-tuning dilemma in offline-to-online setting. if the algorithm remains pessimistic as it does in offline algorithms, the agent learns slowly due to a lack of exploration (green). Conversely, when the algorithm is optimistic, the agent's performance may suffer from a sudden drop due to inefficient use of offline knowledge and radical exploration (orange). We adopt a probability-matching approach to attain a fast and robust performance improvement (blue).
Figure 2: Theoretical upper bound in Theorem \ref{['theorem:1']} and experiment results on Bernoulli bandits. The performance of a Bayesian approach matches the performance of LCB at an early stage by using prior knowledge in the dataset properly and matches the performance of UCB in the run by allowing efficient exploration. Therefore, a realistic Bayesian agent performs better than both optimistic UCB and pessimistic LCB agents.
Figure 3: Experiments between several baselines and BOORL within 0.2M time steps. The reference line is the performance of TD3+BC. The experimental results are averaged with five random seeds. Please refer to Appendix \ref{['appendix: pex']} for more results.
Figure 4: Theoretical upper bounds in Theorem \ref{['theorem:1']} and experiment result on the Bernoulli bandit.
Figure 5: Performance of different switch schemes from LCB to UCB on the Bernoulli bandit. It incurs a large regret to switch from pessimism to optimism regardless of interpolation schemes. $x$ in LCB2UCB ($x$) represents the switch parameter.
...and 4 more figures

Theorems & Definitions (15)

Definition 1: Linear MDP
Definition 2: Information Ratio
Definition 3
Theorem 4
Theorem 5: Regret of Bayesian Agents in Linear MDPs, informal
Proposition 6
Proposition 7
Proposition 8
Theorem 9: Regret of Bayesian Agents in Linear MDPs, restatment
Lemma 10: Failure of UCB
...and 5 more

Bayesian Design Principles for Offline-to-Online Reinforcement Learning

TL;DR

Abstract

Bayesian Design Principles for Offline-to-Online Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (15)