Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs

Angelica Chen; Ravid Shwartz-Ziv; Kyunghyun Cho; Matthew L. Leavitt; Naomi Saphra

Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs

Angelica Chen, Ravid Shwartz-Ziv, Kyunghyun Cho, Matthew L. Leavitt, Naomi Saphra

TL;DR

This paper examines how grammatical capabilities emerge during masked language model pretraining by tracking Syntactic Attention Structure (SAS). It identifies a brief structure onset where SAS spikes in tandem with a steep loss drop, followed by a capabilities onset where grammatical performance improves, suggesting a causal link from internal syntactic representations to external language competence. Through a syntactic regularizer, the authors demonstrate SAS is necessary for complex grammar but can be in competition with an alternative strategy, and that brief early suppression can accelerate learning but may hinder long-term performance if mis-timed. The study frames these dynamics within phase-transition and simplicity-bias theories, offering causal evidence from training interventions and highlighting implications for optimization, curriculum design, and interpretability in neural NLP models.

Abstract

Most interpretability research in NLP focuses on understanding the behavior and features of a fully trained model. However, certain insights into model behavior may only be accessible by observing the trajectory of the training process. We present a case study of syntax acquisition in masked language models (MLMs) that demonstrates how analyzing the evolution of interpretable artifacts throughout training deepens our understanding of emergent behavior. In particular, we study Syntactic Attention Structure (SAS), a naturally emerging property of MLMs wherein specific Transformer heads tend to focus on specific syntactic relations. We identify a brief window in pretraining when models abruptly acquire SAS, concurrent with a steep drop in loss. This breakthrough precipitates the subsequent acquisition of linguistic capabilities. We then examine the causal role of SAS by manipulating SAS during training, and demonstrate that SAS is necessary for the development of grammatical capabilities. We further find that SAS competes with other beneficial traits during training, and that briefly suppressing SAS improves model quality. These findings offer an interpretation of a real-world example of both simplicity bias and breakthrough training dynamics.

Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs

TL;DR

Abstract

Paper Structure (52 sections, 9 equations, 20 figures, 2 tables)

This paper contains 52 sections, 9 equations, 20 figures, 2 tables.

Introduction
Methods
Syntactic Attention Structure
Controlling SAS
Identifying breakthroughs
Models and Data
Architecture and Training
Finetuning and probing
Fine-tuning on GLUE
Evaluating on BLiMP
Evaluating SAS dependency parsing
Results
The Syntax Acquisition Phase
Complexity and Compression
Controlling SAS
...and 37 more sections

Figures (20)

Figure 1: BERT first learns to focus on syntactic neighbors with specialized attention heads, and then exhibits grammatical capabilities in its MLM objective. The former (internal) and the latter (external) model behaviors both emerge abruptly, at moments we respectively call the structure onset ($\blacktriangle$) and capabilities onset ($\newmoon$) (quantified as described in \ref{['sec:inflections']}). We separately visualize three runs with different seeds, noting that these seeds differ in the stability of Unlabeled Attachment Score (UAS; see \ref{['sec:SAS_metric']}) after the structure onset, but uniformly show that SAS emerges almost entirely in a brief window of time. We show \ref{['fig:loss_uas_blimp_baseonly:loss']} MLM loss, with 95% confidence intervals across samples bynonparametric bootstrapping; \ref{['fig:loss_uas_blimp_baseonly:uas']} internal grammar structure, measured by UAS on the parse induced by the attention distributions; and \ref{['fig:loss_uas_blimp_baseonly:blimp']} external grammar capabilities, measured by average BLiMP accuracy with 95% confidence intervals across tasks by nonparametric bootstrapping.
Figure 2: Metrics during $\textrm{BERT}_\textrm{Base}$ training averaged, with 95% confidence intervals, across three seeds. Structure ($\blacktriangle$) and capabilities ($\newmoon$) onsets are marked.
Figure 3: Metrics over the course of training for baseline and SAS-regularized models (under both suppression and promotion of SAS). Structure ($\blacktriangle$) and capabilities ($\newmoon$) onsets are marked, except on $\textrm{BERT}_\textrm{SAS-}$, which does not clearly exhibit either onset. Each line is averaged over three random seeds. On y-axis: \ref{['fig:loss_uas_blimp_consts:loss']} MLM loss \ref{['fig:loss_uas_blimp_consts:uas']} Implicit parse accuracy \ref{['fig:loss_uas_blimp_consts:blimp']} average BLiMP accuracy over all phenomena categories. Shaded regions represent the 95% confidence interval.
Figure 4: Metrics for the checkpoint at 100k steps, for various models with SAS suppressed early in training. The vertical line marks the $\textrm{BERT}_\textrm{SAS-}$ alternative strategy onset; note that model quality is worst when the regularizer is changed during this phase transition. The x-axis reflects the timestep when the regularizer $\lambda$ is changed from $0.001$ to $0$. To control for the length of training time without suppressing SAS, \ref{['sec:50k']} presents the same findings measured at a checkpoint exactly 50K timesteps after releasing the regularizer. On y-axis: \ref{['fig:max_loss']} MLM loss; \ref{['fig:max_uas']} Implicit parse accuracy (UAS); \ref{['fig:max_glue']} GLUE average (Task breakdown in \ref{['sec:glue_task']}); \ref{['fig:max_blimp']} BLiMP average (Task break down in \ref{['sec:blimp_task']}). Shaded regions represent 95% confidence intervals across three seeds.
Figure 5: If SAS is suppressed only briefly, it accelerates and augments the SAS onset. However, further suppression delays and attenuates the spike in UAS, until it eventually ceases to show a clear inflection. A vertical dotted line marks the $\textrm{BERT}_\textrm{SAS-}$ alternative strategy onset and the shaded region indicates the 95% confidence interval across three seeds.
...and 15 more figures

Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs

TL;DR

Abstract

Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (20)