Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization

Kun Lei; Zhengmao He; Chenhao Lu; Kaizhe Hu; Yang Gao; Huazhe Xu

Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization

Kun Lei, Zhengmao He, Chenhao Lu, Kaizhe Hu, Yang Gao, Huazhe Xu

TL;DR

Uni-o4 is proposed, which utilizes an on-policy objective for both offline and online learning, and can yield superior offline initialization as well as stable and rapid online fine-tuning capabilities through real-world robot tasks.

Abstract

Combining offline and online reinforcement learning (RL) is crucial for efficient and safe learning. However, previous approaches treat offline and online learning as separate procedures, resulting in redundant designs and limited performance. We ask: Can we achieve straightforward yet effective offline and online learning without introducing extra conservatism or regularization? In this study, we propose Uni-o4, which utilizes an on-policy objective for both offline and online learning. Owning to the alignment of objectives in two phases, the RL agent can transfer between offline and online learning seamlessly. This property enhances the flexibility of the learning paradigm, allowing for arbitrary combinations of pretraining, fine-tuning, offline, and online learning. In the offline phase, specifically, Uni-o4 leverages diverse ensemble policies to address the mismatch issues between the estimated behavior policy and the offline dataset. Through a simple offline policy evaluation (OPE) approach, Uni-o4 can achieve multi-step policy improvement safely. We demonstrate that by employing the method above, the fusion of these two paradigms can yield superior offline initialization as well as stable and rapid online fine-tuning capabilities. Through real-world robot tasks, we highlight the benefits of this paradigm for rapid deployment in challenging, previously unseen real-world environments. Additionally, through comprehensive evaluations using numerous simulated benchmarks, we substantiate that our method achieves state-of-the-art performance in both offline and offline-to-online fine-tuning learning. Our website: https://lei-kun.github.io/uni-o4/ .

Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization

TL;DR

Abstract

Paper Structure (33 sections, 1 theorem, 26 equations, 29 figures, 6 tables, 2 algorithms)

This paper contains 33 sections, 1 theorem, 26 equations, 29 figures, 6 tables, 2 algorithms.

Introduction
Preliminaries
Method
Ensemble behavior cloning with disagreement-based regularization
Multi-step policy ensemble optimization
Related work
Experiments
Main results
Applications on real-world robots
Ablation study
Conclusion
Appendix
Proof of THEOREM \ref{['thm1']}
Proof of THEOREM \ref{['thm2']}
Related work to real-world robot learning.
...and 18 more sections

Key Result

Proposition 1

Given the dataset $\mathcal{D}$ and policies $\prod_n$, the distance over $\hat{\pi}_{\beta}^i(\cdot|s)$ and $f(\{\hat{\pi}_{\beta}^j(\cdot|s)\})$ can be expressed as $D_\text{KL}\left(\hat{\pi}_{\beta}^i(\cdot|s)||\frac{f \left(\{\hat{\pi}_{\beta}^j(\cdot|s)\}\right)}{Z(s)}\right),$ where $Z(s)$ is

Figures (29)

Figure 1: (a) Normalized return curves of online fine-tuning and offline initialized scores on all Mojoco tasks. (b) Average $V$ or $Q$ -values of value functions on Hopper and Walker2d -medium tasks.
Figure 2: Uni-O4 employs supervised learning to learn the components for initializing the subsequent phase. In offline multi-step optimization phase (middle), policies query AM-Q to determine whether to replace the behavior policies after a certain number of training steps. For instance, AM-Q allows $\pi^{2}$ to replace its behavior policy with its target policy but rejects the others. Subsequently, one policy is selected as the initialization for online fine-tuning. Specifically, $OOS_{\mathcal{D}}$ indicates out-of-support of dataset.
Figure 3: Aggregated learning curves of various approaches on the MuJoCo locomotion and Adroit manipulation tasks. It shares legend with Figure \ref{['fig:online_curve']} for simplicity.
Figure 4: The learning curves of various methods on Adroit and MuJoCo locomotion tasks are presented across five different seeds. The solid lines indicate the mean performance, while the shaded regions represent the corresponding standard deviation.
Figure 5: Real-world experiments: (a) the workflow of Uni-O4 (b) Testing all methods with low-speed commands. The reported results are averaged over five trials, with each trial having a maximum of 1000 time steps. (c) Testing all methods with high-speed commands, see Appendix \ref{['sec:detailed_go1_inf']}.
...and 24 more figures

Theorems & Definitions (2)

Proposition 1
Definition 1

Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization

TL;DR

Abstract

Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (29)

Theorems & Definitions (2)