Table of Contents
Fetching ...

Value Explicit Pretraining for Learning Transferable Representations

Kiran Lekkala, Henghui Bao, Sumedh Sontakke, Laurent Itti

TL;DR

Experiments show that the pretrained encoder produced by the proposed Value Explicit Pretraining method outperforms current SoTA pretraining methods on the ability to generalize to unseen tasks.

Abstract

We propose Value Explicit Pretraining (VEP), a method that learns generalizable representations for transfer reinforcement learning. VEP enables learning of new tasks that share similar objectives as previously learned tasks, by learning an encoder for objective-conditioned representations, irrespective of appearance changes and environment dynamics. To pre-train the encoder from a sequence of observations, we use a self-supervised contrastive loss that results in learning temporally smooth representations. VEP learns to relate states across different tasks based on the Bellman return estimate that is reflective of task progress. Experiments using a realistic navigation simulator and Atari benchmark show that the pretrained encoder produced by our method outperforms current SoTA pretraining methods on the ability to generalize to unseen tasks. VEP achieves up to a 2 times improvement in rewards on Atari and visual navigation, and up to a 3 times improvement in sample efficiency. For videos of policy performance visit our https://sites.google.com/view/value-explicit-pretraining/

Value Explicit Pretraining for Learning Transferable Representations

TL;DR

Experiments show that the pretrained encoder produced by the proposed Value Explicit Pretraining method outperforms current SoTA pretraining methods on the ability to generalize to unseen tasks.

Abstract

We propose Value Explicit Pretraining (VEP), a method that learns generalizable representations for transfer reinforcement learning. VEP enables learning of new tasks that share similar objectives as previously learned tasks, by learning an encoder for objective-conditioned representations, irrespective of appearance changes and environment dynamics. To pre-train the encoder from a sequence of observations, we use a self-supervised contrastive loss that results in learning temporally smooth representations. VEP learns to relate states across different tasks based on the Bellman return estimate that is reflective of task progress. Experiments using a realistic navigation simulator and Atari benchmark show that the pretrained encoder produced by our method outperforms current SoTA pretraining methods on the ability to generalize to unseen tasks. VEP achieves up to a 2 times improvement in rewards on Atari and visual navigation, and up to a 3 times improvement in sample efficiency. For videos of policy performance visit our https://sites.google.com/view/value-explicit-pretraining/
Paper Structure (13 sections, 5 equations, 8 figures, 1 algorithm)

This paper contains 13 sections, 5 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: High-level overview of our problem statement The encoder $f_\phi$ is pretrained using expert videos from a set of train tasks, that is then reused for an unseen task. We evaluate pretrained encoders produced by our method and the baselines on the Atari and Navigation benchmarks.
  • Figure 2: Description of our method (VEP). We compute value estimates (Bellman returns), as denoted by $G$, for each frame. We then use a contrastive learning-based pretraining method that learns task-agnostic representations based on $G$. The above figure is a pictorial representation of a training scenario where the sampling batch size $b_T$ is 2 and the training batch size $b_G$ is 1. This results in anchor, positive and negative sampled from two sequences in each batch.
  • Figure 3: Pretraining results on Atari. Performance of different pretraining methods on the respective games as mentioned above. The encoder is pretrained only on the first 2 games (Demon Attack and Space Invaders) and is evaluated on the other out-of-domain games.
  • Figure 4: Pretraining results on Navigation. Performance of different pretraining methods on the respective cities as mentioned above. Similar to the Atari experiments, for all the baselines, expert videos from the first two tasks (Wall Street and Union Square) were used for pretraining. VEP representations improve PPO policy performance by up to $2\times$.
  • Figure 5: Comparision of our method with End-to-end trained method for Navigation task. Note that in each of the above training curves, the end-to-end baseline has the entire model trained on each of the above tasks, whereas our method (VEP) is pretrained only on expert videos from Wall Street and Union Square. Compared to any pretrained method, End-to-end training baseline takes significantly longer time ($2.1\times$ for Navigation and $3.3\times$ for Atari). Since both the methods were trained for the same number of iterations (20M), our method finished earlier and the dotted line is only for comparison
  • ...and 3 more figures