Table of Contents
Fetching ...

Masked Autoencoding for Scalable and Generalizable Decision Making

Fangchen Liu, Hao Liu, Aditya Grover, Pieter Abbeel

TL;DR

MaskDP introduces a scalable, self-supervised pretraining framework for reinforcement learning that masks state-action trajectories and reconstructs them with a bidirectional Transformer. By employing random masking with multiple ratios, MaskDP learns forward and inverse dynamics from diverse unlabeled data, enabling zero-shot transfer to goal-reaching and skill prompting, and competitive offline RL performance. The method demonstrates model-size scaling and robustness across data quality, while supporting open-loop generation and closed-loop replanning at deployment. Overall, MaskDP provides a general masked-prediction paradigm for sequential decision making that parallels success in NLP and vision.

Abstract

We are interested in learning scalable agents for reinforcement learning that can learn from large-scale, diverse sequential data similar to current large vision and language models. To this end, this paper presents masked decision prediction (MaskDP), a simple and scalable self-supervised pretraining method for reinforcement learning (RL) and behavioral cloning (BC). In our MaskDP approach, we employ a masked autoencoder (MAE) to state-action trajectories, wherein we randomly mask state and action tokens and reconstruct the missing data. By doing so, the model is required to infer masked-out states and actions and extract information about dynamics. We find that masking different proportions of the input sequence significantly helps with learning a better model that generalizes well to multiple downstream tasks. In our empirical study, we find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching, and it can zero-shot infer skills from a few example transitions. In addition, MaskDP transfers well to offline RL and shows promising scaling behavior w.r.t. to model size. It is amenable to data-efficient finetuning, achieving competitive results with prior methods based on autoregressive pretraining.

Masked Autoencoding for Scalable and Generalizable Decision Making

TL;DR

MaskDP introduces a scalable, self-supervised pretraining framework for reinforcement learning that masks state-action trajectories and reconstructs them with a bidirectional Transformer. By employing random masking with multiple ratios, MaskDP learns forward and inverse dynamics from diverse unlabeled data, enabling zero-shot transfer to goal-reaching and skill prompting, and competitive offline RL performance. The method demonstrates model-size scaling and robustness across data quality, while supporting open-loop generation and closed-loop replanning at deployment. Overall, MaskDP provides a general masked-prediction paradigm for sequential decision making that parallels success in NLP and vision.

Abstract

We are interested in learning scalable agents for reinforcement learning that can learn from large-scale, diverse sequential data similar to current large vision and language models. To this end, this paper presents masked decision prediction (MaskDP), a simple and scalable self-supervised pretraining method for reinforcement learning (RL) and behavioral cloning (BC). In our MaskDP approach, we employ a masked autoencoder (MAE) to state-action trajectories, wherein we randomly mask state and action tokens and reconstruct the missing data. By doing so, the model is required to infer masked-out states and actions and extract information about dynamics. We find that masking different proportions of the input sequence significantly helps with learning a better model that generalizes well to multiple downstream tasks. In our empirical study, we find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching, and it can zero-shot infer skills from a few example transitions. In addition, MaskDP transfers well to offline RL and shows promising scaling behavior w.r.t. to model size. It is amenable to data-efficient finetuning, achieving competitive results with prior methods based on autoregressive pretraining.
Paper Structure (48 sections, 18 figures, 9 tables)

This paper contains 48 sections, 18 figures, 9 tables.

Figures (18)

  • Figure 1: Illustration of MaskDP. During pretraining stage, we perform the masked token prediction task. And after pretraining, the model can be deployed to various downstream tasks using different mask patterns.
  • Figure 2: Single task pretraining followed by single goal reaching downstream task. MaskDP with closed-loop execution achieves the best performance on all the tasks, and get the most significant improvements in the Quadruped domain, which is higher dimensional.
  • Figure 3: Single task pretraining followed by multiple goals reaching downstream task. MaskDP achieves significant improvement on all the tasks with better flexibility in sequential goal reaching.
  • Figure 4: Multiple tasks pretraining followed by single goal reaching downstream task, where MaskDP with closed-loop execution works the best, especially in the Quadruped domain.
  • Figure 5: Multiple task pretraining followed by multiple goals reaching downstream task.
  • ...and 13 more figures