BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining

Wen Liang; Youzhi Liang

BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining

Wen Liang, Youzhi Liang

TL;DR

BPDec embeds a pretraining-time MLM decoder into the vanilla BERT pipeline while keeping the encoder unchanged. It introduces Gradual Unmasking Attention and a random encoder–decoder output mix to enrich representations during pretraining, yielding improvements on GLUE and SQuAD without increasing fine-tuning or serving costs. The approach shows competitive performance relative to DeBERTa and generalizes across several BERT-like models, indicating broad applicability. This work offers a practical, efficiency-conscious path to leverage enhanced masked language modeling through a decoder, with potential environmental benefits and avenues for future extension to other architectures.

Abstract

BERT (Bidirectional Encoder Representations from Transformers) has revolutionized the field of natural language processing through its exceptional performance on numerous tasks. Yet, the majority of researchers have mainly concentrated on enhancements related to the model structure, such as relative position embedding and more efficient attention mechanisms. Others have delved into pretraining tricks associated with Masked Language Modeling, including whole word masking. DeBERTa introduced an enhanced decoder adapted for BERT's encoder model for pretraining, proving to be highly effective. We argue that the design and research around enhanced masked language modeling decoders have been underappreciated. In this paper, we propose several designs of enhanced decoders and introduce BPDec (BERT Pretraining Decoder), a novel method for modeling training. Typically, a pretrained BERT model is fine-tuned for specific Natural Language Understanding (NLU) tasks. In our approach, we utilize the original BERT model as the encoder, making only changes to the decoder without altering the encoder. This approach does not necessitate extensive modifications to the encoder architecture and can be seamlessly integrated into existing fine-tuning pipelines and services, offering an efficient and effective enhancement strategy. Compared to other methods, while we also incur a moderate training cost for the decoder during the pretraining process, our approach does not introduce additional training costs during the fine-tuning phase. We test multiple enhanced decoder structures after pretraining and evaluate their performance on the GLUE tasks and SQuAD tasks. Our results demonstrate that BPDec, having only undergone subtle refinements to the model structure during pretraining, significantly enhances model performance without escalating the finetuning cost, inference time and serving budget.

BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining

TL;DR

Abstract

Paper Structure (18 sections, 2 equations, 4 figures, 8 tables)

This paper contains 18 sections, 2 equations, 4 figures, 8 tables.

Introduction
Related Works
Masked Language Modeling
Finetuning
Methods
MLM Decoder
Gradual Unmasking Attention (GUA)
Random Mix of Encoder and Decoder Outputs
Results
Pretraining
Performance on GLUE Tasks
Performance on SQuAD Tasks
Ablation Study
Number of MLM Decoder Layers
Decoder Layers with Gradual Unmasking Attention (GUA)
...and 3 more sections

Figures (4)

Figure 1: Examples of attention heads with and without attention masks. (a) Attention heads will not attend to the masked embedding highlighted in red due to the attention mask. (b) The attention mask is disabled.
Figure 2: Examples of gradual relaxation of attention mask. (a) Encoder layers are unaffected. (b) Randomly unmask a portion of the masked positions. (c) In the final layer of the BPDec, we unmask the rest of the attention mask and fully disable the attention mask.
Figure 3: Random encoder-decoder output mixing.
Figure 4: Attention heatmap comparing masked attention and our GUA mechanism in the last layer of a pretrained BERT+BPDec model.

BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining

TL;DR

Abstract

BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining

Authors

TL;DR

Abstract

Table of Contents

Figures (4)