Inconsistencies in Masked Language Models

Tom Young; Yunan Chen; Yang You

Inconsistencies in Masked Language Models

Tom Young, Yunan Chen, Yang You

TL;DR

This paper shows that distributions corresponding to different masking patterns can demonstrate considerable inconsistencies, i.e., they cannot be derived from a coherent joint distribution when considered together when considered together, and proposes an inference-time strategy for MLMs called Ensemble of Conditionals.

Abstract

Learning to predict masked tokens in a sequence has been shown to be a helpful pretraining objective for powerful language models such as PaLM2. After training, such masked language models (MLMs) can provide distributions of tokens in the masked positions in a sequence. However, this paper shows that distributions corresponding to different masking patterns can demonstrate considerable inconsistencies, i.e., they cannot be derived from a coherent joint distribution when considered together. This fundamental flaw in MLMs can lead to self-contradictory behaviors during inference. On various benchmark datasets including MMLU, MLMs can give different predictions to the same input question. From BERT-base to UL2-20B, we show that such inconsistencies exist ubiquitously in MLMs of diverse sizes and configurations. In light of our observations, we further propose an inference-time strategy for MLMs called Ensemble of Conditionals. It jointly considers a selected range of inconsistent conditionals directly produced by the MLM for the final prediction, which often leads to considerable accuracy improvement.

Inconsistencies in Masked Language Models

TL;DR

Abstract

Paper Structure (14 sections, 7 equations, 6 figures, 1 table)

This paper contains 14 sections, 7 equations, 6 figures, 1 table.

Introduction
Why inconsistencies can occur in MLMs
Backbone MLMs
T5-style
BERT-style
Inconsistencies in T5-style MLMs
Conditionals for various mask patterns
Exposing inconsistencies
Ensemble of Conditionals
Inconsistencies in BERT-style MLMs
Summary & Discussions
Why not use Llama in the experiments?
No. bidirectional conditionals specified by MLMs
Mask patterns

Figures (6)

Figure 1: Self-ensembling improves MLMs' accuracies on standard benchmarks including MMLU, Lambada and BigBench. Aggregated results based on Figure \ref{['fig:eoc_accuracy']}.
Figure 2: A simple bigram comparison example that exposes the inconsistencies in the T5 model. The conditional probabilities that the model learned (quoted from T5-11B fed with the shown masked sequences) contradict each other greatly. Not only are the ratios unbalanced, the model confuses its own preference of the two bigrams.
Figure 3: K-offset and Multimask patterns. The goal here is to prompt the MLM for different versions of the target token distribution. The red token is our target token. The coral tokens are taken from the original input sequence and fed as starting tokens to the decoder of the MLM.
Figure 4: Different conditionals disagree on the prediction to make.
Figure 5: EOC improves MLM accuracy
...and 1 more figures

Inconsistencies in Masked Language Models

TL;DR

Abstract

Inconsistencies in Masked Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)