Table of Contents
Fetching ...

CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models

Eitan Wagner, Yuli Slavutsky, Omri Abend

TL;DR

The findings reveal that both Masked Language Models (MLMs) and autoregressive models exhibit inconsistent predictions, with autoregressive models showing larger discrepancies, while larger MLMs tend to produce more consistent predictions, while autoregressive models show the opposite trend.

Abstract

Although language model scores are often treated as probabilities, their reliability as probability estimators has mainly been studied through calibration, overlooking other aspects. In particular, it is unclear whether language models produce the same value for different ways of assigning joint probabilities to word spans. Our work introduces a novel framework, ConTestS (Consistency Testing over Spans), involving statistical tests to assess score consistency across interchangeable completion and conditioning orders. We conduct experiments on post-release real and synthetic data to eliminate training effects. Our findings reveal that both Masked Language Models (MLMs) and autoregressive models exhibit inconsistent predictions, with autoregressive models showing larger discrepancies. Larger MLMs tend to produce more consistent predictions, while autoregressive models show the opposite trend. Moreover, for both model types, prediction entropies offer insights into the true word span likelihood and therefore can aid in selecting optimal decoding strategies. The inconsistencies revealed by our analysis, as well their connection to prediction entropies and differences between model types, can serve as useful guides for future research on addressing these limitations.

CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models

TL;DR

The findings reveal that both Masked Language Models (MLMs) and autoregressive models exhibit inconsistent predictions, with autoregressive models showing larger discrepancies, while larger MLMs tend to produce more consistent predictions, while autoregressive models show the opposite trend.

Abstract

Although language model scores are often treated as probabilities, their reliability as probability estimators has mainly been studied through calibration, overlooking other aspects. In particular, it is unclear whether language models produce the same value for different ways of assigning joint probabilities to word spans. Our work introduces a novel framework, ConTestS (Consistency Testing over Spans), involving statistical tests to assess score consistency across interchangeable completion and conditioning orders. We conduct experiments on post-release real and synthetic data to eliminate training effects. Our findings reveal that both Masked Language Models (MLMs) and autoregressive models exhibit inconsistent predictions, with autoregressive models showing larger discrepancies. Larger MLMs tend to produce more consistent predictions, while autoregressive models show the opposite trend. Moreover, for both model types, prediction entropies offer insights into the true word span likelihood and therefore can aid in selecting optimal decoding strategies. The inconsistencies revealed by our analysis, as well their connection to prediction entropies and differences between model types, can serve as useful guides for future research on addressing these limitations.
Paper Structure (44 sections, 14 equations, 7 figures, 2 tables)

This paper contains 44 sections, 14 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Experimental Design - Joint Prediction Estimation with Masked Language Modeling. The middle white row displays the original unmasked tokens. Below, in blue, the joint probability is calculated by first estimating the probability of the correct token in MASK$_1$ and then of MASK$_2$ (after revealing the correct token in MASK$_1$). In the top rows, in green, the calculation is in the reversed order -- first estimating the probability of the correct token in MASK$_2$ and then in MASK$_1$ (after revealing MASK$_2$).
  • Figure 2: Discrepancy Results on the (a) Wikitext and (b) News datasets. Each model is represented by a boxplot displaying discrepancy values. MLMs appear in purple shades on the left of each figure and autoregressive models in green on the right. Color intensity indicates model sizes. Boxes show quartile values with median lines; whiskers extend to 1.5 IQR from quartiles. Outliers are omitted for clarity.
  • Figure 3: Prediction Ranking for the examined Models. Rank 1 represents the ranks for the first prediction (two masks) and Rank 2 for the second (one mask). Results were obtained from a sample size of 200 on the News dataset. Ranks are in log scale. Lower ranks indicate more accurate predictions. Boxes show quartile values with median lines; whiskers extend to 1.5 IQR from quartiles.
  • Figure 4: Entropy and Decoding Order: Correlations between four prediction entropies and the discrepancy $d_{i, i+1}$ are presented. (a) presents the two entropies associated with predicting the $i$-th token first. (b) illustrates the two entropies corresponding to predicting the $(i+1)$-th token first. For each entropy the distribution of correlations for autoregressive models is depicted in purple on the left, and for MLM models in green on the right. Dashed lines within each violin represent the first, second (i.e., median), and third quartiles, respectively.
  • Figure 5: Discrepancy Results on the Synthetic datasets.
  • ...and 2 more figures