Table of Contents
Fetching ...

Echoes of BERT: Do Modern Language Models Rediscover the Classical NLP Pipeline?

Michael Li, Nishant Subramani

TL;DR

This study extends BERTology to 25 transformer models and eight English and multilingual tasks to test whether modern LMs retain the classical NLP pipeline, showing that syntax is captured in early layers, semantics in middle layers, and discourse in later layers across architectures. Larger models tend to move these information peaks earlier in depth, indicating faster consolidation of linguistic knowledge, while still preserving hierarchical organization. A focused multilingual analysis of lexical identity (lemma) and inflectional morphology reveals that lemma information is predominantly linear in early layers and becomes nonlinear deeper in the network, whereas inflectional morphology remains linearly accessible across layers and languages; steering experiments demonstrate functional manipulability of inflection representations. The results suggest robust, architecture-agnostic regularities in how careful pretraining and model capacity shape the internal structuring of linguistic information, with implications for interpretability and controllability of LMs.

Abstract

Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information relies primarily on studies of early models like BERT and GPT-2. Building on classic BERTology work, we analyze 25 models spanning from classical architectures (BERT, DeBERTa, GPT-2) to modern large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1), probing layer-by-layer representations across eight linguistic tasks in English. Consistent with earlier findings, we find that hierarchical organization persists in modern models: early layers capture syntax, middle layers handle semantics and entity-level information, and later layers encode discourse phenomena. We dive deeper, conducting an in-depth multilingual analysis of two specific linguistic properties - lexical identity and inflectional morphology - that help disentangle form from meaning. We find that lexical information concentrates linearly in early layers but becomes increasingly nonlinear deeper in the network, while inflectional information remains linearly accessible throughout all layers. Additional analyses of attention mechanisms, steering vectors, and pretraining checkpoints reveal where this information resides within layers, how it can be functionally manipulated, and how representations evolve during pretraining. Taken together, our findings suggest that, even with substantial advances in LLM technologies, transformer models learn to organize linguistic information in similar ways, regardless of model architecture, size, or training regime, indicating that these properties are important for next token prediction. Our code is available at https://github.com/ml5885/model_internal_sleuthing

Echoes of BERT: Do Modern Language Models Rediscover the Classical NLP Pipeline?

TL;DR

This study extends BERTology to 25 transformer models and eight English and multilingual tasks to test whether modern LMs retain the classical NLP pipeline, showing that syntax is captured in early layers, semantics in middle layers, and discourse in later layers across architectures. Larger models tend to move these information peaks earlier in depth, indicating faster consolidation of linguistic knowledge, while still preserving hierarchical organization. A focused multilingual analysis of lexical identity (lemma) and inflectional morphology reveals that lemma information is predominantly linear in early layers and becomes nonlinear deeper in the network, whereas inflectional morphology remains linearly accessible across layers and languages; steering experiments demonstrate functional manipulability of inflection representations. The results suggest robust, architecture-agnostic regularities in how careful pretraining and model capacity shape the internal structuring of linguistic information, with implications for interpretability and controllability of LMs.

Abstract

Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information relies primarily on studies of early models like BERT and GPT-2. Building on classic BERTology work, we analyze 25 models spanning from classical architectures (BERT, DeBERTa, GPT-2) to modern large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1), probing layer-by-layer representations across eight linguistic tasks in English. Consistent with earlier findings, we find that hierarchical organization persists in modern models: early layers capture syntax, middle layers handle semantics and entity-level information, and later layers encode discourse phenomena. We dive deeper, conducting an in-depth multilingual analysis of two specific linguistic properties - lexical identity and inflectional morphology - that help disentangle form from meaning. We find that lexical information concentrates linearly in early layers but becomes increasingly nonlinear deeper in the network, while inflectional information remains linearly accessible throughout all layers. Additional analyses of attention mechanisms, steering vectors, and pretraining checkpoints reveal where this information resides within layers, how it can be functionally manipulated, and how representations evolve during pretraining. Taken together, our findings suggest that, even with substantial advances in LLM technologies, transformer models learn to organize linguistic information in similar ways, regardless of model architecture, size, or training regime, indicating that these properties are important for next token prediction. Our code is available at https://github.com/ml5885/model_internal_sleuthing

Paper Structure

This paper contains 68 sections, 9 equations, 25 figures, 26 tables.

Figures (25)

  • Figure 1: Overview of our classifier methodology. We extract hidden state activations from each model layer for target words and train classifiers for token, span and pairwise edge predictions (POS, dependencies, constituents, NER, SRL, SPR, coreference, and relations), as well as word-level lemma and inflection prediction. We compare linear regression, MLP, and random-forest classifiers, compute selectivity using control labels, and summarize where performance emerges with expected layer and center of gravity.
  • Figure 2: Left: heatmaps of probe accuracy by layer for three models (BERT-Base, Qwen2.5-1.5B, OLMo-2-1124-7B-Instruct). The top row shows MLP probe accuracy, the bottom row shows linear probe accuracy. Right: Pearson correlations between models are computed by vectorizing each model's per-layer, per-task accuracy grid. The lower triangular matrix shows MLP probe accuracy correlations, the upper triangular matrix shows linear probe accuracy correlations.
  • Figure 3: Expected Layer (blue, \ref{['eq:expected_layer']}) and Center of Gravity (purple, \ref{['eq:cog']}) for the same three models. $\tau_{\text{lin}}$ and $\tau_{\text{MLP}}$ give selectivity (real vs. control accuracy) for linear and MLP probes. Higher $\tau$ indicates probes extract true signal rather than memorizing.
  • Figure 4: Linguistic accuracy and classifier selectivity across model layers for English. The first two columns show lemma (top) and inflection (bottom) prediction accuracy using Linear Regression (left) and MLP (right) classifiers. The next two columns show classifier selectivity (difference between linguistic and control task accuracy) for the same tasks and classifiers. Higher selectivity indicates better generalization rather than memorization. Each line represents a different model. Multilingual results are shown in \ref{['fig:multilingual_accuracy_and_selectivity']}.
  • Figure 5: Cross-linguistic patterns in linguistic accuracy and classifier selectivity. The first two columns show lemma (top) and inflection (bottom) prediction accuracy using Linear Regression (left) and MLP (right) classifiers. The third column shows inflection prediction with Random Forest classifiers (lemma prediction is computationally infeasible due to the large number of classes). The rightmost columns show classifier selectivity for lemma (top) and inflection (bottom) tasks. Each line represents a different model-language combination.
  • ...and 20 more figures