Dynamic Masking Rate Schedules for MLM Pretraining
Zachary Ankner, Naomi Saphra, Davis Blalock, Jonathan Frankle, Matthew L. Leavitt
TL;DR
This paper tackles the fixed masking-rate limitation in MLM pretraining by introducing a dynamic masking-rate schedule that decreases the masking proportion over training. The authors formulate a linear schedule from an initial to a final rate, test it on BERT-base and BERT-large pretrained on C4, and evaluate on GLUE; results show consistent improvements in average GLUE accuracy and substantial pretraining speedups, yielding Pareto improvements over fixed-rate baselines. Key findings include the necessity of starting with a higher masking rate and decaying it, the benefit of decreasing over increasing schedules, and the generalization of the approach to other objectives like RTS and to grammar benchmarks like BLiMP. The method is simple, produces both better linguistic and masked-language modeling performance, and offers practical gains in training efficiency, with scope for future work in multilingual and encoder-decoder settings.
Abstract
Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model's fixed masking rate of 15%. We propose to instead dynamically schedule the masking rate throughout training. We find that linearly decreasing the masking rate over the course of pretraining improves average GLUE accuracy by up to 0.46% and 0.25% in BERT-base and BERT-large, respectively, compared to fixed rate baselines. These gains come from exposure to both high and low masking rate regimes, providing benefits from both settings. Our results demonstrate that masking rate scheduling is a simple way to improve the quality of masked language models, achieving up to a 1.89x speedup in pretraining for BERT-base as well as a Pareto improvement for BERT-large.
