Table of Contents
Fetching ...

Making Offline RL Online: Collaborative World Models for Offline Visual Reinforcement Learning

Qi Wang, Junming Yang, Yunbo Wang, Xin Jin, Wenjun Zeng, Xiaokang Yang

TL;DR

To enable effective online-to-offline knowledge transfer, CoWorld is introduced, a model-based RL approach that mitigates cross-domain discrepancies in state and reward spaces and outperforming existing RL approaches by large margins.

Abstract

Training offline RL models using visual inputs poses two significant challenges, i.e., the overfitting problem in representation learning and the overestimation bias for expected future rewards. Recent work has attempted to alleviate the overestimation bias by encouraging conservative behaviors. This paper, in contrast, tries to build more flexible constraints for value estimation without impeding the exploration of potential advantages. The key idea is to leverage off-the-shelf RL simulators, which can be easily interacted with in an online manner, as the "test bed" for offline policies. To enable effective online-to-offline knowledge transfer, we introduce CoWorld, a model-based RL approach that mitigates cross-domain discrepancies in state and reward spaces. Experimental results demonstrate the effectiveness of CoWorld, outperforming existing RL approaches by large margins.

Making Offline RL Online: Collaborative World Models for Offline Visual Reinforcement Learning

TL;DR

To enable effective online-to-offline knowledge transfer, CoWorld is introduced, a model-based RL approach that mitigates cross-domain discrepancies in state and reward spaces and outperforming existing RL approaches by large margins.

Abstract

Training offline RL models using visual inputs poses two significant challenges, i.e., the overfitting problem in representation learning and the overestimation bias for expected future rewards. Recent work has attempted to alleviate the overestimation bias by encouraging conservative behaviors. This paper, in contrast, tries to build more flexible constraints for value estimation without impeding the exploration of potential advantages. The key idea is to leverage off-the-shelf RL simulators, which can be easily interacted with in an online manner, as the "test bed" for offline policies. To enable effective online-to-offline knowledge transfer, we introduce CoWorld, a model-based RL approach that mitigates cross-domain discrepancies in state and reward spaces. Experimental results demonstrate the effectiveness of CoWorld, outperforming existing RL approaches by large margins.
Paper Structure (46 sections, 12 equations, 11 figures, 13 tables, 1 algorithm)

This paper contains 46 sections, 12 equations, 11 figures, 13 tables, 1 algorithm.

Figures (11)

  • Figure 1: Our approach for offline visual RL.
  • Figure 2: To address value overestimation in offline RL (a), we can directly penalize the estimated values beyond the distribution of offline data, which may hinder the agent's exploration of potential states with high rewards (b). Unlike existing methods, CoWorld trains a cross-domain critic model in an online auxiliary domain to reassess the offline policy (c), and regularizes the target values with flexible constraints (d). The feasibility of this approach lies in the domain alignment techniques during the world model learning stage.
  • Figure 3: Left: The value in each grid indicates the ratio of returns achieved by CoWorld compared to Offline DV2. Highlighted grids represent the top-performing source domain. Right: Returns on Drawer Close (DC*) with different source domains, where the multi-source CoWorld (yellow line) is shown to automatically discover (i.e., Door Close) as the source domain and achieve comparable results with the top-performing single-source CoWorld (red line).
  • Figure 4: Quantitative results in domain transfer scenarios of Meta-World $\rightarrow$ RoboDesk.
  • Figure 5: (a) Ablation studies on state alignment, reward alignment, and min-max value constraint. (b) The disparities between the estimated value by various models and the true value. Please see the text in Section \ref{['rec:ablation']} for the implementation of CoWorld w/o Max.
  • ...and 6 more figures