Table of Contents
Fetching ...

MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining

Jacob Portes, Alex Trott, Sam Havens, Daniel King, Abhinav Venigalla, Moin Nadeem, Nikhil Sardana, Daya Khudia, Jonathan Frankle

TL;DR

This paper addresses the high cost of pretraining BERT‑style encoders and presents MosaicBERT, a BERT‑style architecture augmented with throughput‑oriented components (FlashAttention, ALiBi, GLU) and efficiency enhancements (unpadding, bf16 LayerNorm, 30% MLM). The authors systematically compare MosaicBERT against a strong BERT baseline using accuracy‑versus‑time Pareto analysis on GLUE, demonstrating that MosaicBERT achieves favorable speed‑accuracy tradeoffs and Pareto optimality across base and large configurations. Key findings include MosaicBERT‑Base attaining 79.6 average GLUE (dev) in 1.13 hours on 8×A100‑80GB (cost ≈ $20), while the BERT‑Base baseline attains 83.2 in 11.5 hours; larger MosaicBERT models also show Pareto optimality in many regimes. The work provides actionable guidance for fast, cost‑effective pretraining and releases code and weights to enable researchers to build domain‑specific encoders and scale toward LLM pretraining efficiently.

Abstract

Although BERT-style encoder models are heavily used in NLP research, many researchers do not pretrain their own BERTs from scratch due to the high cost of training. In the past half-decade since BERT first rose to prominence, many advances have been made with other transformer architectures and training configurations that have yet to be systematically incorporated into BERT. Here, we introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining. This efficient architecture incorporates FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units (GLU), a module to dynamically remove padded tokens, and low precision LayerNorm into the classic transformer encoder block. The training recipe includes a 30% masking ratio for the Masked Language Modeling (MLM) objective, bfloat16 precision, and vocabulary size optimized for GPU throughput, in addition to best-practices from RoBERTa and other encoder models. When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs at a cost of roughly $20. We plot extensive accuracy vs. pretraining speed Pareto curves and show that MosaicBERT base and large are consistently Pareto optimal when compared to a competitive BERT base and large. This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models. We open source our model weights and code.

MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining

TL;DR

This paper addresses the high cost of pretraining BERT‑style encoders and presents MosaicBERT, a BERT‑style architecture augmented with throughput‑oriented components (FlashAttention, ALiBi, GLU) and efficiency enhancements (unpadding, bf16 LayerNorm, 30% MLM). The authors systematically compare MosaicBERT against a strong BERT baseline using accuracy‑versus‑time Pareto analysis on GLUE, demonstrating that MosaicBERT achieves favorable speed‑accuracy tradeoffs and Pareto optimality across base and large configurations. Key findings include MosaicBERT‑Base attaining 79.6 average GLUE (dev) in 1.13 hours on 8×A100‑80GB (cost ≈ $20), while the BERT‑Base baseline attains 83.2 in 11.5 hours; larger MosaicBERT models also show Pareto optimality in many regimes. The work provides actionable guidance for fast, cost‑effective pretraining and releases code and weights to enable researchers to build domain‑specific encoders and scale toward LLM pretraining efficiently.

Abstract

Although BERT-style encoder models are heavily used in NLP research, many researchers do not pretrain their own BERTs from scratch due to the high cost of training. In the past half-decade since BERT first rose to prominence, many advances have been made with other transformer architectures and training configurations that have yet to be systematically incorporated into BERT. Here, we introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining. This efficient architecture incorporates FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units (GLU), a module to dynamically remove padded tokens, and low precision LayerNorm into the classic transformer encoder block. The training recipe includes a 30% masking ratio for the Masked Language Modeling (MLM) objective, bfloat16 precision, and vocabulary size optimized for GPU throughput, in addition to best-practices from RoBERTa and other encoder models. When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs at a cost of roughly $20. We plot extensive accuracy vs. pretraining speed Pareto curves and show that MosaicBERT base and large are consistently Pareto optimal when compared to a competitive BERT base and large. This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models. We open source our model weights and code.
Paper Structure (31 sections, 3 equations, 11 figures, 6 tables)

This paper contains 31 sections, 3 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: (A) Schematic of MosaicBERT architecture (B) Pareto curves of average GLUE (dev) scores for MosaicBERT-Base and the standard BERT-Base. Error bars indicate 95% confidence interval over n=5 pretraining seeds. All training was on $8\times$A100-80GB GPUs. FlashAttention schematic adapted from dao2022flashattention, and unpadding schematic adapted from zeng2022boosting).
  • Figure 2: Performance on individual GLUE (dev) finetuning tasks. Our MosaicBERT-Base consistently outperforms BERT-Base on MNLI-m, QNLI, QQP and RTE, and has comparable performance on CoLA, SST-2, MRPC and STSB. Wall clock time is for $8\times$A100-80GB GPUs, and does not include finetuning time. Error bars are plotted with 95% confidence interval across $n=5$ pretraining seeds, and all models are trained for 70,000 steps with batch size 4096.
  • Figure 3: Average GLUE (dev) score Pareto curves for MosaicBERT-Base and Large trained for roughly 2 epochs of C4 (i.e. 178,000 steps with batch size 4096 with maximum sequence length 128 tokens). MosaicBERT-Base and Large are Pareto optimal relative to BERT-Base and Large. All pretraining is done on $8\times$A100 80GB devices (n=2-3 pretraining seeds). Note that BERT-Base and MosaicBERT-Base took much less time to train than BERT-Large and MosaicBERT-Large.
  • Figure 4: Ablation Experiments (A) Average GLUE score and (B) Individual GLUE tasks. BERT-base: standard BERT-base (110M parameters) with attention dropout=0.1 and feedforward dropout=0.1, vocab size set to 30522, MLM=15% (all Hugging Face standard configurations). BERT+drpt=0: standard BERT-base, except that the attention in the dropout layer is set to 0 instead of the default 0.1. BERT+GLU: standard BERT-base, with GLU for the feedforward component of the encoder block. BERT+lpLN: standard BERT-base, except with low precision LayerNorm (bfloat16). BERT+mlm30: standard BERT-base, except with a masked language modeling masking ratio of 30%. MosaicBERT: the complete MosaicBERT-Base including GLU (where the dimension of the intermediate layer is 3072 resulting in 137M total parameters), ALiBi, low precision LayerNorm, unpadding, MLM 30%, vocab size 30528 (a multiple of 64) and the attention dropout=0. MosaicBERT-FlashAttn+drpt=0.1: MosaicBERT-Base without Flash Attention and with the attention dropout set to 0.1.
  • Figure S1: Pareto curves for BERT-Base and MosaicBERT-Base for runs trained for 70,000 and 178,000 steps with batch size 4096 and sequence length 128. Same data as Figure \ref{['fig:intro_figure']}B and \ref{['fig:bert_large_glue_av']}.
  • ...and 6 more figures