Towards Probabilistically-Sound Beam Search with Masked Language Models

Creston Brooks; Robert Calef; Charlie Cowen-Breen; Anna Sappington

Towards Probabilistically-Sound Beam Search with Masked Language Models

Creston Brooks, Robert Calef, Charlie Cowen-Breen, Anna Sappington

TL;DR

The paper tackles how to perform beam search with masked language models by addressing the absence of a readily available joint distribution $p(\mathbf{x})$ and proposing a probabilistically-sound framework based on the $Hammersley{-}Clifford{-}Besag$ (HCB) construction. It derives conditions under which the standard MLM infilling approximation is valid and introduces an adjustment term that preserves probabilistic soundness without additional forward passes, yielding the HCB beam search method. Empirically, HCB beam search demonstrates improvements over standard beam search across multiple models and domains, while ablations reveal the importance of context, pivots, and the context signal captured by $p([\mathop{M}]|\mathbf{x})$. These results enable more reliable text infilling for applications like ancient text restoration and protein engineering and offer practical guidance on pivot design and when HCB is advantageous.

Abstract

Beam search with masked language models (MLMs) is challenging in part because joint probability distributions over sequences are not readily available, unlike for autoregressive models. However, estimating such distributions has important domain-specific applications such as ancient text restoration and protein engineering. Here we present probabilistically-sound methods for beam search with MLMs. First, we clarify the conditions under which it is theoretically sound to perform text infilling with MLMs using standard beam search. When these conditions fail, we provide a probabilistically-sound inference time modification with no additional computational complexity and demonstrate that it is superior to the aforementioned beam search in the expected conditions. We then present empirical results comparing several infilling approaches with MLMs across several domains. Notably, our method probes the inductive biases of MLMs and explores the surprising contextual sensitivity of mask tokens for text infilling.

Towards Probabilistically-Sound Beam Search with Masked Language Models

TL;DR

The paper tackles how to perform beam search with masked language models by addressing the absence of a readily available joint distribution

and proposing a probabilistically-sound framework based on the

(HCB) construction. It derives conditions under which the standard MLM infilling approximation is valid and introduces an adjustment term that preserves probabilistic soundness without additional forward passes, yielding the HCB beam search method. Empirically, HCB beam search demonstrates improvements over standard beam search across multiple models and domains, while ablations reveal the importance of context, pivots, and the context signal captured by

. These results enable more reliable text infilling for applications like ancient text restoration and protein engineering and offer practical guidance on pivot design and when HCB is advantageous.

Abstract

Paper Structure (29 sections, 3 theorems, 17 equations, 7 figures, 6 tables, 2 algorithms)

This paper contains 29 sections, 3 theorems, 17 equations, 7 figures, 6 tables, 2 algorithms.

Introduction
Theoretical contributions.
Empirical contributions.
Text Infilling
Background and Related Work
Beam search
Challenges with MLM beam search
Constructing joint distributions from conditionals
Methods
Choosing a pivot
Experimental Setup
Models
Metrics
Datasets
Baselines
...and 14 more sections

Key Result

Theorem 1

Suppose that $p$ represents a model which achieves minimal training loss on the MLM objective. Then, on the training distribution, the learned conditionals are both compatible and satisfy the conditional independence assumption.

Figures (7)

Figure 1: Overview of the proposed HCB beam search compared to standard beam search for text infilling.
Figure 2: Difference in Ithaca infilling accuracy across all choices of pivots.
Figure 3: Comparison of HCB beam search with standard beam search, nucleus sampling, and pure sampling. Evaluated on 10,000 examples from the SNLI dataset. When comparing nucleus sampling to beam search with beam size $B$, we draw $B$ samples for a fair comparison.
Figure 4: Performance of HCB beam search versus standard beam search methods across three English MLMs and DistilBERT, with beam size 20.
Figure 5: Performance of HCB beam search compared to standard beam search for various choices of pivot. All pivot experiments performed with beam size of 5 and gap size of 2 on the subset UniProt dataset containing 10,000 protein sequences. 50,000 trials were run for each pivot choice.
...and 2 more figures

Theorems & Definitions (6)

Theorem 1: Informal
Theorem 2: HCB
Theorem 3
proof
proof : Proof of \ref{['thm:proper']}
proof : Proof that compatibility and conditional independence imply validity of \ref{['eqn:approxbeam']}

Towards Probabilistically-Sound Beam Search with Masked Language Models

TL;DR

Abstract

Towards Probabilistically-Sound Beam Search with Masked Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (6)