Table of Contents
Fetching ...

Inconsistencies in Masked Language Models

Tom Young, Yunan Chen, Yang You

TL;DR

This paper shows that distributions corresponding to different masking patterns can demonstrate considerable inconsistencies, i.e., they cannot be derived from a coherent joint distribution when considered together when considered together, and proposes an inference-time strategy for MLMs called Ensemble of Conditionals.

Abstract

Learning to predict masked tokens in a sequence has been shown to be a helpful pretraining objective for powerful language models such as PaLM2. After training, such masked language models (MLMs) can provide distributions of tokens in the masked positions in a sequence. However, this paper shows that distributions corresponding to different masking patterns can demonstrate considerable inconsistencies, i.e., they cannot be derived from a coherent joint distribution when considered together. This fundamental flaw in MLMs can lead to self-contradictory behaviors during inference. On various benchmark datasets including MMLU, MLMs can give different predictions to the same input question. From BERT-base to UL2-20B, we show that such inconsistencies exist ubiquitously in MLMs of diverse sizes and configurations. In light of our observations, we further propose an inference-time strategy for MLMs called Ensemble of Conditionals. It jointly considers a selected range of inconsistent conditionals directly produced by the MLM for the final prediction, which often leads to considerable accuracy improvement.

Inconsistencies in Masked Language Models

TL;DR

This paper shows that distributions corresponding to different masking patterns can demonstrate considerable inconsistencies, i.e., they cannot be derived from a coherent joint distribution when considered together when considered together, and proposes an inference-time strategy for MLMs called Ensemble of Conditionals.

Abstract

Learning to predict masked tokens in a sequence has been shown to be a helpful pretraining objective for powerful language models such as PaLM2. After training, such masked language models (MLMs) can provide distributions of tokens in the masked positions in a sequence. However, this paper shows that distributions corresponding to different masking patterns can demonstrate considerable inconsistencies, i.e., they cannot be derived from a coherent joint distribution when considered together. This fundamental flaw in MLMs can lead to self-contradictory behaviors during inference. On various benchmark datasets including MMLU, MLMs can give different predictions to the same input question. From BERT-base to UL2-20B, we show that such inconsistencies exist ubiquitously in MLMs of diverse sizes and configurations. In light of our observations, we further propose an inference-time strategy for MLMs called Ensemble of Conditionals. It jointly considers a selected range of inconsistent conditionals directly produced by the MLM for the final prediction, which often leads to considerable accuracy improvement.
Paper Structure (14 sections, 7 equations, 6 figures, 1 table)

This paper contains 14 sections, 7 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Self-ensembling improves MLMs' accuracies on standard benchmarks including MMLU, Lambada and BigBench. Aggregated results based on Figure \ref{['fig:eoc_accuracy']}.
  • Figure 2: A simple bigram comparison example that exposes the inconsistencies in the T5 model. The conditional probabilities that the model learned (quoted from T5-11B fed with the shown masked sequences) contradict each other greatly. Not only are the ratios unbalanced, the model confuses its own preference of the two bigrams.
  • Figure 3: K-offset and Multimask patterns. The goal here is to prompt the MLM for different versions of the target token distribution. The red token is our target token. The coral tokens are taken from the original input sequence and fed as starting tokens to the decoder of the MLM.
  • Figure 4: Different conditionals disagree on the prediction to make.
  • Figure 5: EOC improves MLM accuracy
  • ...and 1 more figures