Table of Contents
Fetching ...

Dynamic Masking Rate Schedules for MLM Pretraining

Zachary Ankner, Naomi Saphra, Davis Blalock, Jonathan Frankle, Matthew L. Leavitt

TL;DR

This paper tackles the fixed masking-rate limitation in MLM pretraining by introducing a dynamic masking-rate schedule that decreases the masking proportion over training. The authors formulate a linear schedule from an initial to a final rate, test it on BERT-base and BERT-large pretrained on C4, and evaluate on GLUE; results show consistent improvements in average GLUE accuracy and substantial pretraining speedups, yielding Pareto improvements over fixed-rate baselines. Key findings include the necessity of starting with a higher masking rate and decaying it, the benefit of decreasing over increasing schedules, and the generalization of the approach to other objectives like RTS and to grammar benchmarks like BLiMP. The method is simple, produces both better linguistic and masked-language modeling performance, and offers practical gains in training efficiency, with scope for future work in multilingual and encoder-decoder settings.

Abstract

Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model's fixed masking rate of 15%. We propose to instead dynamically schedule the masking rate throughout training. We find that linearly decreasing the masking rate over the course of pretraining improves average GLUE accuracy by up to 0.46% and 0.25% in BERT-base and BERT-large, respectively, compared to fixed rate baselines. These gains come from exposure to both high and low masking rate regimes, providing benefits from both settings. Our results demonstrate that masking rate scheduling is a simple way to improve the quality of masked language models, achieving up to a 1.89x speedup in pretraining for BERT-base as well as a Pareto improvement for BERT-large.

Dynamic Masking Rate Schedules for MLM Pretraining

TL;DR

This paper tackles the fixed masking-rate limitation in MLM pretraining by introducing a dynamic masking-rate schedule that decreases the masking proportion over training. The authors formulate a linear schedule from an initial to a final rate, test it on BERT-base and BERT-large pretrained on C4, and evaluate on GLUE; results show consistent improvements in average GLUE accuracy and substantial pretraining speedups, yielding Pareto improvements over fixed-rate baselines. Key findings include the necessity of starting with a higher masking rate and decaying it, the benefit of decreasing over increasing schedules, and the generalization of the approach to other objectives like RTS and to grammar benchmarks like BLiMP. The method is simple, produces both better linguistic and masked-language modeling performance, and offers practical gains in training efficiency, with scope for future work in multilingual and encoder-decoder settings.

Abstract

Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model's fixed masking rate of 15%. We propose to instead dynamically schedule the masking rate throughout training. We find that linearly decreasing the masking rate over the course of pretraining improves average GLUE accuracy by up to 0.46% and 0.25% in BERT-base and BERT-large, respectively, compared to fixed rate baselines. These gains come from exposure to both high and low masking rate regimes, providing benefits from both settings. Our results demonstrate that masking rate scheduling is a simple way to improve the quality of masked language models, achieving up to a 1.89x speedup in pretraining for BERT-base as well as a Pareto improvement for BERT-large.
Paper Structure (35 sections, 6 equations, 5 figures, 9 tables)

This paper contains 35 sections, 6 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Average GLUE accuracy evaluated over the course of pretraining for BERT-base. The horizontal lines correspond to the difference in steps required for linear-0.3-0.15 to achieve the best constant schedule performance.
  • Figure 2: Average GLUE accuracy evaluated over the course of pretraining for BERT-large.
  • Figure 3: Pretraining step vs interpolated average GLUE accuracy for BERT-base.
  • Figure 4: Various masking rate schedules we considered. Schedules can be constant, increasing or decreasing, and change following a linear, cosine, or step function.
  • Figure 5: Pretraining step vs interpolated average GLUE accuracy for RTS with BERT-base.