Do Transformers Parse while Predicting the Masked Word?

Haoyu Zhao; Abhishek Panigrahi; Rong Ge; Sanjeev Arora

Do Transformers Parse while Predicting the Masked Word?

Haoyu Zhao, Abhishek Panigrahi, Rong Ge, Sanjeev Arora

TL;DR

<3-5 sentence high-level summary> The paper investigates whether Transformer-based masked language models actually perform parsing or merely encode parsing signals. Using PCFGs as a controlled setting, it shows that transformers can approximate the Inside-Outside algorithm and that Inside-Outside provides an MLM-optimal loss on PCFG data, linking parsing mechanics to MLM objectives. It constructs theoretical and practical transformer configurations that realize Inside-Outside at realistic sizes and demonstrates, via probing with PCFG-trained models, that embeddings contain both parse trees and marginal span probabilities. The findings illuminate why syntactic structure emerges in MLMs and suggest scalable approaches to embed parsing capabilities into more compact transformer architectures.

Abstract

Pre-trained language models have been shown to encode linguistic structures, e.g. dependency and constituency parse trees, in their embeddings while being trained on unsupervised loss functions like masked language modeling. Some doubts have been raised whether the models actually are doing parsing or only some computation weakly correlated with it. We study questions: (a) Is it possible to explicitly describe transformers with realistic embedding dimension, number of heads, etc. that are capable of doing parsing -- or even approximate parsing? (b) Why do pre-trained models capture parsing structure? This paper takes a step toward answering these questions in the context of generative modeling with PCFGs. We show that masked language models like BERT or RoBERTa of moderate sizes can approximately execute the Inside-Outside algorithm for the English PCFG [Marcus et al, 1993]. We also show that the Inside-Outside algorithm is optimal for masked language modeling loss on the PCFG-generated data. We also give a construction of transformers with $50$ layers, $15$ attention heads, and $1275$ dimensional embeddings in average such that using its embeddings it is possible to do constituency parsing with $>70\%$ F1 score on PTB dataset. We conduct probing experiments on models pre-trained on PCFG-generated data to show that this not only allows recovery of approximate parse tree, but also recovers marginal span probabilities computed by the Inside-Outside algorithm, which suggests an implicit bias of masked language modeling towards this algorithm.

Do Transformers Parse while Predicting the Masked Word?

TL;DR

Abstract

layers,

attention heads, and

dimensional embeddings in average such that using its embeddings it is possible to do constituency parsing with

F1 score on PTB dataset. We conduct probing experiments on models pre-trained on PCFG-generated data to show that this not only allows recovery of approximate parse tree, but also recovers marginal span probabilities computed by the Inside-Outside algorithm, which suggests an implicit bias of masked language modeling towards this algorithm.

Paper Structure (99 sections, 5 theorems, 35 equations, 7 figures, 8 tables)

This paper contains 99 sections, 5 theorems, 35 equations, 7 figures, 8 tables.

Introduction
This paper
Preliminaries
Attention
PCFG and parsing
PCFG model
Data generation from PCFG
Parse trees and parsing
Probing
Parsing using Transformers
Transformers can execute Inside-Outside algorithm
Masked language modeling for PCFG
Towards realistic size
First ingredient: finding important non-terminals
Second ingredient: utilizing structures across non-terminals
...and 84 more sections

Key Result

Theorem 3.1

There exists a model with hard attention modules (eq:hard_attention), $(4|{\mathcal{N}}| + 1)L$ embeddings, $2L-1$ layers, and $4|{\mathcal{N}}|$ attention heads in each layer that simulates the Inside-Outside algorithm on all sentences with length at most $L$ generated by PCFG ${\mathcal{G}} = ({\m

Figures (7)

Figure 1: Comparison between different probes (linear or a 2-layer neural net) under different settings. 2-layer probes achieve better parsing performance, compared to linear probes. The large performance gap of the probes on layer 0's embeddings from A12L12 and the best layer shows the existence of meaningful syntactic information in the contextualized embeddings.
Figure 2: Comparison between different probes for marginal probabilities on the A12L12 model. The y-axis denotes correlation between the prediction and the target, and the x-axis denotes probes for different lengths.
Figure 3: Sentence F1 for linear probes $f(\cdot)$ trained on different layers' embeddings for different pre-trained models. We show the results under PCFG and PTB settings. A$i$L$j$ denotes the pre-trained model with $i$ attention heads and $j$ layers.
Figure 4: Comparison of the probes with different inputs under different settings. We probe the model with 12 attention heads and 12 layers, and report the scores with $f(\cdot)$ taking embeddings from layer 0 or the embeddings from the best layer. EOS denotes the probe that takes $[{\bm{e}}^{(\ell)}_i; {\bm{e}}^{(\ell)}_{\text{EOS}}]$ as input and predicts the relative depth $\text{tar}(i)$, and ADJ (Adjacent embeddings) denotes the probe that takes $[{\bm{e}}^{(\ell)}_{i-1}; {\bm{e}}^{(\ell)}_i; {\bm{e}}^{(\ell)}_{i+1}]$ as input.
Figure 5: The predicted probability versus true normalized marginal probability plot for different span lengths $\ell$ using 2-layer NN probe with the 12-th layer's representations from A12L12 model. In each figure, we sample 200 points (each point corresponds to a span) to plot from the test set. The y-axis denotes the predicted probabilities and the x-axis denotes the true normalized marginal probabilities. The line shows the best linear fit for all the spans in the test set. We can observe that there are lots of points that have very small normalized marginal probabilities, and it is very hard to predict their rank correctly, thus resulting in a low Spearman correlation.
...and 2 more figures

Theorems & Definitions (8)

Theorem 3.1: Hard attention
proof : Proof sketch
Theorem 3.2: Relative positional embeddings
Theorem 3.3
Theorem 3.4: Informal
proof
proof : Proof of \ref{['thm:io-optimal-mlm']}
Theorem C.2: Informal

Do Transformers Parse while Predicting the Masked Word?

TL;DR

Abstract

Do Transformers Parse while Predicting the Masked Word?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (8)