Table of Contents
Fetching ...

GPT or BERT: why not both?

Lucas Georges Gabriel Charpentier, David Samuel

TL;DR

GPT-BERT presents a simple, unified approach to combine masked language modeling and causal language modeling within a single transformer. By shifting MLM outputs to align with next-token predictions, the model can operate in MLM, CLM, or prefix modes without architectural changes. Across BabyLM benchmarks, the hybrid objective improves performance relative to single-objective baselines and enables in-context learning signals in compact models. The work demonstrates that a 1:15 causal-to-masked data ratio, along with targeted modifications (attention gate, layer weighting, and scheduling strategies), yields robust, versatile language representations with efficient training. This suggests that merging modeling paradigms can enhance generalization and practical applicability in low-resource settings.

Abstract

We present a simple way to merge masked language modeling with causal language modeling. This hybrid training objective results in a model that combines the strengths of both modeling paradigms within a single transformer stack: GPT-BERT can be transparently used like any standard causal or masked language model. We test the pretraining process that enables this flexible behavior on the BabyLM Challenge 2024. The results show that the hybrid pretraining outperforms masked-only or causal-only models. We openly release the models, training corpora and code.

GPT or BERT: why not both?

TL;DR

GPT-BERT presents a simple, unified approach to combine masked language modeling and causal language modeling within a single transformer. By shifting MLM outputs to align with next-token predictions, the model can operate in MLM, CLM, or prefix modes without architectural changes. Across BabyLM benchmarks, the hybrid objective improves performance relative to single-objective baselines and enables in-context learning signals in compact models. The work demonstrates that a 1:15 causal-to-masked data ratio, along with targeted modifications (attention gate, layer weighting, and scheduling strategies), yields robust, versatile language representations with efficient training. This suggests that merging modeling paradigms can enhance generalization and practical applicability in low-resource settings.

Abstract

We present a simple way to merge masked language modeling with causal language modeling. This hybrid training objective results in a model that combines the strengths of both modeling paradigms within a single transformer stack: GPT-BERT can be transparently used like any standard causal or masked language model. We test the pretraining process that enables this flexible behavior on the BabyLM Challenge 2024. The results show that the hybrid pretraining outperforms masked-only or causal-only models. We openly release the models, training corpora and code.

Paper Structure

This paper contains 48 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Two modes of a single model Causal and masked language modeling can be easily unified by shifting both outputs by one token to the right. Then we can train one language model on both paradigms at the same time just by modifying the input tokens, output tokens and attention masks.
  • Figure 2: The effect of the causal-to-mask ratio Comparison of performance of different tasks when varying the ratio of MNTP used during pre-training. We also look at the performance of the model using prefix language modeling with a partially-bidirectional attention mask. MNLI scores are reported with standard deviation error bars estimated by averaging the variations across three finetuning random seeds.
  • Figure 3: SST-2 in-context learning 20-shots ICL results on the SST-2 validation set for models trained on the 100M BabyLM datasets with varying degrees of each objective. The demonstrations (shots) were chosen at random from the training dataset. We do 20-runs and report mean as well as standard deviation. Note that the accuracy of the majority baseline on this dataset is 51.8%.
  • Figure 4: BLiMP-Supplement Accuracy Comparison of BLiMP-Supplement accuracy when varying the ratio of MNTP used during pre-training. We set the temperature to apply on the logits to 1 for fair comparison between the evaluation strategies. Fused is the sum of the logits from the causal and masked evaluation.
  • Figure 5: EWoK Accuracy Comparison of EWoK accuracy when varying the ratio of MNTP used during pre-training. We set the temperature to apply on the logits to 1 for fair comparison between the evaluation strategies. Fused is the sum of the logits from the causal and masked evaluation. We also look at the performance of the model using a prefix masking strategy where the whole context is visible to the model.