Table of Contents
Fetching ...

Masking as an Efficient Alternative to Finetuning for Pretrained Language Models

Mengjie Zhao, Tao Lin, Fei Mi, Martin Jaggi, Hinrich Schütze

TL;DR

Analyzing the loss landscape, it is shown that Masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy, confirming that masking can be utilized as an efficient alternative to finetuned.

Abstract

We present an efficient method of utilizing pretrained language models, where we learn selective binary masks for pretrained weights in lieu of modifying them through finetuning. Extensive evaluations of masking BERT and RoBERTa on a series of NLP tasks show that our masking scheme yields performance comparable to finetuning, yet has a much smaller memory footprint when several tasks need to be inferred simultaneously. Through intrinsic evaluations, we show that representations computed by masked language models encode information necessary for solving downstream tasks. Analyzing the loss landscape, we show that masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy. This confirms that masking can be utilized as an efficient alternative to finetuning.

Masking as an Efficient Alternative to Finetuning for Pretrained Language Models

TL;DR

Analyzing the loss landscape, it is shown that Masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy, confirming that masking can be utilized as an efficient alternative to finetuned.

Abstract

We present an efficient method of utilizing pretrained language models, where we learn selective binary masks for pretrained weights in lieu of modifying them through finetuning. Extensive evaluations of masking BERT and RoBERTa on a series of NLP tasks show that our masking scheme yields performance comparable to finetuning, yet has a much smaller memory footprint when several tasks need to be inferred simultaneously. Through intrinsic evaluations, we show that representations computed by masked language models encode information necessary for solving downstream tasks. Analyzing the loss landscape, we show that masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy. This confirms that masking can be utilized as an efficient alternative to finetuning.

Paper Structure

This paper contains 35 sections, 8 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Dev set performance of masking BERT when selecting different amounts of pretrained parameters.
  • Figure 2: The impact of masking different transformer blocks of BERT for MRPC (left), CoLA (middle), and RTE (right). The number of masked blocks is shown on the x-axis; that number is either masked "bottom-up" or "top-down". More precisely, a bottom-up setup (red) masking 4 blocks means we mask the transformer blocks $\{ 0,1,2,3 \}$; a top-down setup (blue) masking 4 blocks means we mask the transformer blocks $\{ 8,9,10,11 \}$. $\mathbf{W}_{P}$ and $\mathbf{W}_{T}$ are always masked.
  • Figure 3: The accumulated number of parameters and memory required by finetuning and masking to solve an increasing number of tasks.
  • Figure 4: t-SNE visualization of the representation of [CLS] computed by the topmost transformer block in pretrained (left), finetuned (top right), and masked (bottom right) BERT/RoBERTa. We use scikit-learnscikit-learn and default t-SNE parameters.
  • Figure 5: Scores $s$ of two sets of masks, trained with two different tasks, of layer $\mathbf{W}_O$ in transformer blocks 2 (left) and 11 (right) in BERT. A large $s$ means that the two masks are dissimilar.
  • ...and 2 more figures