Table of Contents
Fetching ...

Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning

Qi Wang, Zhipeng Zhang, Baao Xie, Xin Jin, Yunbo Wang, Shiyu Wang, Liaomo Zheng, Xiaokang Yang, Wenjun Zeng

Abstract

Training visual reinforcement learning (RL) in practical scenarios presents a significant challenge, $\textit{i.e.,}$ RL agents suffer from low sample efficiency in environments with variations. While various approaches have attempted to alleviate this issue by disentangled representation learning, these methods usually start learning from scratch without prior knowledge of the world. This paper, in contrast, tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints. To enable effective cross-domain semantic knowledge transfer, we introduce an interpretable model-based RL framework, dubbed Disentangled World Models (DisWM). Specifically, we pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. The disentanglement capability of the pretrained model is then transferred to the world model through latent distillation. For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model. During the adaptation phase, the incorporation of actions and rewards from online environment interactions enriches the diversity of the data, which in turn strengthens the disentangled representation learning. Experimental results validate the superiority of our approach on various benchmarks.

Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning

Abstract

Training visual reinforcement learning (RL) in practical scenarios presents a significant challenge, RL agents suffer from low sample efficiency in environments with variations. While various approaches have attempted to alleviate this issue by disentangled representation learning, these methods usually start learning from scratch without prior knowledge of the world. This paper, in contrast, tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints. To enable effective cross-domain semantic knowledge transfer, we introduce an interpretable model-based RL framework, dubbed Disentangled World Models (DisWM). Specifically, we pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. The disentanglement capability of the pretrained model is then transferred to the world model through latent distillation. For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model. During the adaptation phase, the incorporation of actions and rewards from online environment interactions enriches the diversity of the data, which in turn strengthens the disentangled representation learning. Experimental results validate the superiority of our approach on various benchmarks.

Paper Structure

This paper contains 30 sections, 9 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Overview of our proposed framework. The key idea is to leverage distracting videos for semantic knowledge transfer, enabling the downstream agent to improve sample efficiency on unseen tasks.
  • Figure 2: Architecture of Disentangled World Models. (a) The action-free video prediction model with disentanglement constraints is pretrained on distracting videos offline for the well-disentangled latent variable $\textbf{z}_{\text{disen}}$, which extracts semantic knowledge from the visual observations. The disentangled capability of $\textbf{z}_{\text{disen}}$ is then transferred to the world model through latent distillation. (b) The action-conditioned world model is finetuned through disentanglement regularization online, which encourages the factorized representation. Moreover, the incorporation of actions and rewards enriches the diversity of the data, which in turn strengthens the disentangled representation learning.
  • Figure 3: Example image observations of our modified DMC and MuJoCo Pusher with color distractors.
  • Figure 4: Comparison of DisWM against visual RL baselines, including DreamerV2hafner2021mastering, APVseo2022reinforcement, DV2 Finetune, TEDdunion2023temporal, CURLlaskin2020curl.
  • Figure 5: Visualization of traversals of $\beta$-VAE during the pretraining phase.
  • ...and 5 more figures