Table of Contents
Fetching ...

Representation Matters: Offline Pretraining for Sequential Decision Making

Mengjiao Yang, Ofir Nachum

TL;DR

The paper investigates whether offline data can be leveraged to learn state representations that accelerate downstream sequential decision-making. It systematically evaluates a broad set of unsupervised objectives, with a focus on contrastive self-prediction, using offline Gym-MuJoCo datasets across imitation, offline RL, and online RL tasks. The key finding is that certain contrastive objectives (notably ACL, TCL, and VPN) substantially improve performance, with detailed ablations showing the effects of horizon length, input components, and training strategies. This work provides practical guidance on representation-learning choices for offline data and highlights nuanced interactions between representation methods and downstream tasks. It also points to future work in multi-task, transfer, and real-world robotic settings.

Abstract

The recent success of supervised learning methods on ever larger offline datasets has spurred interest in the reinforcement learning (RL) field to investigate whether the same paradigms can be translated to RL algorithms. This research area, known as offline RL, has largely focused on offline policy optimization, aiming to find a return-maximizing policy exclusively from offline data. In this paper, we consider a slightly different approach to incorporating offline data into sequential decision-making. We aim to answer the question, what unsupervised objectives applied to offline datasets are able to learn state representations which elevate performance on downstream tasks, whether those downstream tasks be online RL, imitation learning from expert demonstrations, or even offline policy optimization based on the same offline dataset? Through a variety of experiments utilizing standard offline RL datasets, we find that the use of pretraining with unsupervised learning objectives can dramatically improve the performance of policy learning algorithms that otherwise yield mediocre performance on their own. Extensive ablations further provide insights into what components of these unsupervised objectives -- e.g., reward prediction, continuous or discrete representations, pretraining or finetuning -- are most important and in which settings.

Representation Matters: Offline Pretraining for Sequential Decision Making

TL;DR

The paper investigates whether offline data can be leveraged to learn state representations that accelerate downstream sequential decision-making. It systematically evaluates a broad set of unsupervised objectives, with a focus on contrastive self-prediction, using offline Gym-MuJoCo datasets across imitation, offline RL, and online RL tasks. The key finding is that certain contrastive objectives (notably ACL, TCL, and VPN) substantially improve performance, with detailed ablations showing the effects of horizon length, input components, and training strategies. This work provides practical guidance on representation-learning choices for offline data and highlights nuanced interactions between representation methods and downstream tasks. It also points to future work in multi-task, transfer, and real-world robotic settings.

Abstract

The recent success of supervised learning methods on ever larger offline datasets has spurred interest in the reinforcement learning (RL) field to investigate whether the same paradigms can be translated to RL algorithms. This research area, known as offline RL, has largely focused on offline policy optimization, aiming to find a return-maximizing policy exclusively from offline data. In this paper, we consider a slightly different approach to incorporating offline data into sequential decision-making. We aim to answer the question, what unsupervised objectives applied to offline datasets are able to learn state representations which elevate performance on downstream tasks, whether those downstream tasks be online RL, imitation learning from expert demonstrations, or even offline policy optimization based on the same offline dataset? Through a variety of experiments utilizing standard offline RL datasets, we find that the use of pretraining with unsupervised learning objectives can dramatically improve the performance of policy learning algorithms that otherwise yield mediocre performance on their own. Extensive ablations further provide insights into what components of these unsupervised objectives -- e.g., reward prediction, continuous or discrete representations, pretraining or finetuning -- are most important and in which settings.

Paper Structure

This paper contains 38 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: A summary of the advantages of representation learning via contrastive self-prediction, across a variety of settings: imitation learning, offline RL, and online RL. Each subplot shows the aggregated mean reward and standard error during training, with aggregation over offline datasets of different behavior (e.g., expert, medium, etc.), with five seeds per dataset (see Section \ref{['sec:setup']}). Representation learning yields significant performance gains in all domains and tasks.
  • Figure 2: Performance of downstream imitation learning, offline RL, and online RL tasks under a variety of representation learning objectives. $x$-axis shows aggregated average rewards (over five seeds) across the domains and datasets described in Section \ref{['sec:setup']}. Methods that failed to converge are eliminated from the results (see Appendix \ref{['app:exp']}). ACL is set to the default configuration that favors imitation learning (see Section \ref{['exp:depth']}). When applicable, we also label variants with $k+1\in\{2,8\}$. Methods above the dotted line are variants of contrastive self-prediction. ACL performs well on imitation learning. VPN and (momentum) TCL perform well on offline and online RL.
  • Figure 3: A pictoral representation of our depth study based on contrastive self-prediction. We use the transformer-based architecture of attentive contrastive learning (ACL) as a skeleton for ablations with respect to various representation learning details. Solid arrows correspond to the configuration of ACL. Dotted arrows and blue text are factors considered in the ablation study. Gray blocks are masked state/action/reward entries. After the pretraining phase, the representation network $\phi$ is reused for downstream tasks, unless 'context embedding' is true, in which case the transformer is used.
  • Figure 4: Ablation results on imitation learning, offline RL, and online RL. $x$-axis shows average rewards and standard error aggregated over either different Gym-MuJoCo datasets (imitation and offline RL) or domains (online RL). Blue dotted lines show average rewards without pretraining. (T) and (F) mean setting each factor to true or false (opposite from the default configuration). Reconstructing, predicting, or inputting action or reward (row 2-7) impairs imitation performance but are important for offline and online RL. Bidirectional transformer hurts imitation learning when downstream sample size is small. Finetuning and auxiliary loss can help online RL. Additional results are presented in Appendix \ref{['app:exp_results']}.
  • Figure 5: Average reward across domains and datasets with different representation dimensions. $256$ and $512$ work the best (this ablation is conducted with "reconstruct action" and "reconstruct reward" set to true).
  • ...and 4 more figures