Table of Contents
Fetching ...

Learning Versatile Skills with Curriculum Masking

Yao Tang, Zhihui Xie, Zichuan Lin, Deheng Ye, Shuai Li

TL;DR

CurrMask is proposed, a curriculum masking pretraining paradigm for sequential decision making motivated by how humans learn by organizing knowledge in a curriculum, which exhibits superior zero-shot performance on skill prompting tasks, goal-conditioned planning tasks, and competitive finetuning performance on offline RL tasks.

Abstract

Masked prediction has emerged as a promising pretraining paradigm in offline reinforcement learning (RL) due to its versatile masking schemes, enabling flexible inference across various downstream tasks with a unified model. Despite the versatility of masked prediction, it remains unclear how to balance the learning of skills at different levels of complexity. To address this, we propose CurrMask, a curriculum masking pretraining paradigm for sequential decision making. Motivated by how humans learn by organizing knowledge in a curriculum, CurrMask adjusts its masking scheme during pretraining for learning versatile skills. Through extensive experiments, we show that CurrMask exhibits superior zero-shot performance on skill prompting tasks, goal-conditioned planning tasks, and competitive finetuning performance on offline RL tasks. Additionally, our analysis of training dynamics reveals that CurrMask gradually acquires skills of varying complexity by dynamically adjusting its masking scheme.

Learning Versatile Skills with Curriculum Masking

TL;DR

CurrMask is proposed, a curriculum masking pretraining paradigm for sequential decision making motivated by how humans learn by organizing knowledge in a curriculum, which exhibits superior zero-shot performance on skill prompting tasks, goal-conditioned planning tasks, and competitive finetuning performance on offline RL tasks.

Abstract

Masked prediction has emerged as a promising pretraining paradigm in offline reinforcement learning (RL) due to its versatile masking schemes, enabling flexible inference across various downstream tasks with a unified model. Despite the versatility of masked prediction, it remains unclear how to balance the learning of skills at different levels of complexity. To address this, we propose CurrMask, a curriculum masking pretraining paradigm for sequential decision making. Motivated by how humans learn by organizing knowledge in a curriculum, CurrMask adjusts its masking scheme during pretraining for learning versatile skills. Through extensive experiments, we show that CurrMask exhibits superior zero-shot performance on skill prompting tasks, goal-conditioned planning tasks, and competitive finetuning performance on offline RL tasks. Additionally, our analysis of training dynamics reveals that CurrMask gradually acquires skills of varying complexity by dynamically adjusting its masking scheme.

Paper Structure

This paper contains 37 sections, 6 equations, 8 figures, 5 tables, 2 algorithms.

Figures (8)

  • Figure 1: Illustration of CurrMask. Based on the framework of masked prediction modeling in (a), the block-wise masking scheme in (b), we design a masking pool $\mathcal{M}$ in (c) consisting of masking schemes at different levels of complexity, which CurrMask evaluates the learning progress of the model $\mathcal{\theta}$ and samples masking schemes from during pretraining in (d).
  • Figure 2: Offline RL results. We report the finetuning performance of models pretrained with different masking schemes. Results are averaged over 10 random seeds.
  • Figure 3: Both block-wise masking and curriculum masking contribute to CurrMask's performance. Left: the performance of zero-shot skill prompting as a function of fixed block size. Right: the probabilities of choosing different block sizes and mask ratios during pretraining with CurrMask.
  • Figure 4: Analysis of long-term prediction capability. Left: We visualize the attention map (L2-normalized over different heads) of the first decoder layer, when the model is conducting skill prompting. The horizontal axis represents the keys, and the vertical axis represents the queries. Right: the performance of zero-shot skill prompting as a function of rollout length (tasks: walker run and jaco reach top left).
  • Figure 5: Skill prompting performance on Walker run in the pretraining phase.
  • ...and 3 more figures