Dynamic Masking Rate Schedules for MLM Pretraining

Zachary Ankner; Naomi Saphra; Davis Blalock; Jonathan Frankle; Matthew L. Leavitt

Dynamic Masking Rate Schedules for MLM Pretraining

Zachary Ankner, Naomi Saphra, Davis Blalock, Jonathan Frankle, Matthew L. Leavitt

TL;DR

This paper tackles the fixed masking-rate limitation in MLM pretraining by introducing a dynamic masking-rate schedule that decreases the masking proportion over training. The authors formulate a linear schedule from an initial to a final rate, test it on BERT-base and BERT-large pretrained on C4, and evaluate on GLUE; results show consistent improvements in average GLUE accuracy and substantial pretraining speedups, yielding Pareto improvements over fixed-rate baselines. Key findings include the necessity of starting with a higher masking rate and decaying it, the benefit of decreasing over increasing schedules, and the generalization of the approach to other objectives like RTS and to grammar benchmarks like BLiMP. The method is simple, produces both better linguistic and masked-language modeling performance, and offers practical gains in training efficiency, with scope for future work in multilingual and encoder-decoder settings.

Abstract

Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model's fixed masking rate of 15%. We propose to instead dynamically schedule the masking rate throughout training. We find that linearly decreasing the masking rate over the course of pretraining improves average GLUE accuracy by up to 0.46% and 0.25% in BERT-base and BERT-large, respectively, compared to fixed rate baselines. These gains come from exposure to both high and low masking rate regimes, providing benefits from both settings. Our results demonstrate that masking rate scheduling is a simple way to improve the quality of masked language models, achieving up to a 1.89x speedup in pretraining for BERT-base as well as a Pareto improvement for BERT-large.

Dynamic Masking Rate Schedules for MLM Pretraining

TL;DR

Abstract

Paper Structure (35 sections, 6 equations, 5 figures, 9 tables)

This paper contains 35 sections, 6 equations, 5 figures, 9 tables.

Introduction
Methods
Masked language modeling
Schedulers
Constant scheduling.
Linear scheduling.
Experiments and Results
Improvement in downstream tasks
Improvement in training efficiency
High to low, not low to high
Masking and loss are both necessary for improved performance
Improvement in grammar capabilities
Improvement in the pretraining objective
Related work
Masked Language Modeling
...and 20 more sections

Figures (5)

Figure 1: Average GLUE accuracy evaluated over the course of pretraining for BERT-base. The horizontal lines correspond to the difference in steps required for linear-0.3-0.15 to achieve the best constant schedule performance.
Figure 2: Average GLUE accuracy evaluated over the course of pretraining for BERT-large.
Figure 3: Pretraining step vs interpolated average GLUE accuracy for BERT-base.
Figure 4: Various masking rate schedules we considered. Schedules can be constant, increasing or decreasing, and change following a linear, cosine, or step function.
Figure 5: Pretraining step vs interpolated average GLUE accuracy for RTS with BERT-base.

Dynamic Masking Rate Schedules for MLM Pretraining

TL;DR

Abstract

Dynamic Masking Rate Schedules for MLM Pretraining

Authors

TL;DR

Abstract

Table of Contents

Figures (5)