Echoes of BERT: Do Modern Language Models Rediscover the Classical NLP Pipeline?

Michael Li; Nishant Subramani

Echoes of BERT: Do Modern Language Models Rediscover the Classical NLP Pipeline?

Michael Li, Nishant Subramani

TL;DR

This study extends BERTology to 25 transformer models and eight English and multilingual tasks to test whether modern LMs retain the classical NLP pipeline, showing that syntax is captured in early layers, semantics in middle layers, and discourse in later layers across architectures. Larger models tend to move these information peaks earlier in depth, indicating faster consolidation of linguistic knowledge, while still preserving hierarchical organization. A focused multilingual analysis of lexical identity (lemma) and inflectional morphology reveals that lemma information is predominantly linear in early layers and becomes nonlinear deeper in the network, whereas inflectional morphology remains linearly accessible across layers and languages; steering experiments demonstrate functional manipulability of inflection representations. The results suggest robust, architecture-agnostic regularities in how careful pretraining and model capacity shape the internal structuring of linguistic information, with implications for interpretability and controllability of LMs.

Abstract

Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information relies primarily on studies of early models like BERT and GPT-2. Building on classic BERTology work, we analyze 25 models spanning from classical architectures (BERT, DeBERTa, GPT-2) to modern large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1), probing layer-by-layer representations across eight linguistic tasks in English. Consistent with earlier findings, we find that hierarchical organization persists in modern models: early layers capture syntax, middle layers handle semantics and entity-level information, and later layers encode discourse phenomena. We dive deeper, conducting an in-depth multilingual analysis of two specific linguistic properties - lexical identity and inflectional morphology - that help disentangle form from meaning. We find that lexical information concentrates linearly in early layers but becomes increasingly nonlinear deeper in the network, while inflectional information remains linearly accessible throughout all layers. Additional analyses of attention mechanisms, steering vectors, and pretraining checkpoints reveal where this information resides within layers, how it can be functionally manipulated, and how representations evolve during pretraining. Taken together, our findings suggest that, even with substantial advances in LLM technologies, transformer models learn to organize linguistic information in similar ways, regardless of model architecture, size, or training regime, indicating that these properties are important for next token prediction. Our code is available at https://github.com/ml5885/model_internal_sleuthing

Echoes of BERT: Do Modern Language Models Rediscover the Classical NLP Pipeline?

TL;DR

Abstract

Echoes of BERT: Do Modern Language Models Rediscover the Classical NLP Pipeline?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (25)