What We Talk About When We Talk About LMs: Implicit Paradigm Shifts and the Ship of Language Models

Shengqi Zhu; Jeffrey M. Rzeszotarski

What We Talk About When We Talk About LMs: Implicit Paradigm Shifts and the Ship of Language Models

Shengqi Zhu, Jeffrey M. Rzeszotarski

TL;DR

The paper investigates how the term Language Models (LMs) functions as a time-variant referent, akin to a Ship of Theseus, across NLP literature. It builds a data infrastructure from 7,650 papers in ACL, EMNLP, and NAACL (2020–2023), introducing two keyword sets: $oldsymbol{\mathcal{L}}$ for collective LM mentions and $oldsymbol{\mathcal{M}}$ for specific models, with model names automatically detected via a GPT-4-turbo workflow and manually validated to yield 103 models and 155 aliases. Through quantitative analyses of $N^{\mathcal{L}}$, $N_m$, and related metrics, the study reveals a dramatic rise in LM discourse post-2021, a shift in referents from older archetypes (e.g., BERT) toward newer generative models (e.g., GPT-family, LLaMA, T5 variants), and a decreasing cross-conference similarity in model compositions. The findings highlight how a stable, high-level term can mask substantial changes in its concrete referents, underscoring the need for fine-grained, diachronic analyses to interpret scientific progress and guide long-term knowledge management in fast-evolving fields.

Abstract

The term Language Models (LMs) as a time-specific collection of models of interest is constantly reinvented, with its referents updated much like the $\textit{Ship of Theseus}$ replaces its parts but remains the same ship in essence. In this paper, we investigate this $\textit{Ship of Language Models}$ problem, wherein scientific evolution takes the form of continuous, implicit retrofits of key existing terms. We seek to initiate a novel perspective of scientific progress, in addition to the more well-studied emergence of new terms. To this end, we construct the data infrastructure based on recent NLP publications. Then, we perform a series of text-based analyses toward a detailed, quantitative understanding of the use of Language Models as a term of art. Our work highlights how systems and theories influence each other in scientific discourse, and we call for attention to the transformation of this Ship that we all are contributing to.

What We Talk About When We Talk About LMs: Implicit Paradigm Shifts and the Ship of Language Models

TL;DR

for collective LM mentions and

for specific models, with model names automatically detected via a GPT-4-turbo workflow and manually validated to yield 103 models and 155 aliases. Through quantitative analyses of

, and related metrics, the study reveals a dramatic rise in LM discourse post-2021, a shift in referents from older archetypes (e.g., BERT) toward newer generative models (e.g., GPT-family, LLaMA, T5 variants), and a decreasing cross-conference similarity in model compositions. The findings highlight how a stable, high-level term can mask substantial changes in its concrete referents, underscoring the need for fine-grained, diachronic analyses to interpret scientific progress and guide long-term knowledge management in fast-evolving fields.

Abstract

The term Language Models (LMs) as a time-specific collection of models of interest is constantly reinvented, with its referents updated much like the

replaces its parts but remains the same ship in essence. In this paper, we investigate this

problem, wherein scientific evolution takes the form of continuous, implicit retrofits of key existing terms. We seek to initiate a novel perspective of scientific progress, in addition to the more well-studied emergence of new terms. To this end, we construct the data infrastructure based on recent NLP publications. Then, we perform a series of text-based analyses toward a detailed, quantitative understanding of the use of Language Models as a term of art. Our work highlights how systems and theories influence each other in scientific discourse, and we call for attention to the transformation of this Ship that we all are contributing to.

Paper Structure (27 sections, 3 equations, 13 figures, 1 table)

This paper contains 27 sections, 3 equations, 13 figures, 1 table.

Introduction
Related Work
Diachronic Analysis of the Progress in NLP
Paradigm Shifts and Scientific Trends
Methods
Dataset Construction
Default Setup
Retrieving the Mentions of LMs
Notations
Constructing M from the text
Experiments and Findings
Wind in the Sails: Surging Mentions, Speeding Conclusions
What about the actual models we use?
Oak, Pine, or Cedar Planks: Which models are we talking about?
One dominant model or many contributors?
...and 12 more sections

Figures (13)

Figure 1: The full pipeline for constructing the model dictionaries (§\ref{['subsubsec:construct-m']}). The LLM agent follows a formatted prompt (§\ref{['appendix:prompt']}) to automatically identify potential model names. The extracted strings are merged and ranked by frequency to form the list of candidate names. Then, the authors manually validate whether it is a new entry, an alias, or other based on a fixed protocol: (1) whether the candidate is indeed a valid model name, (2) whether it refers to the same model as an existing entry, and (3) whether it's already covered by an alias of that entry.
Figure 2: Increase of interest in LMs as a topic (a) and as a term in use (b). (a): The proportion of papers containing keywords in $\mathcal{L}$ by years. (b): The estimated value of $\bar{N^\mathcal{L}}$ based on the proportions (dashed line) compared with the actual $\bar{N^\mathcal{L}}$ (solid line).
Figure 3: Pairwise comparisons of the distributions of $N^\mathcal{L}$ (upper) and $N$ (lower). Each grid corresponds to a pair of conferences indexed by its row and column, and depicts results from two analyses: a K-S test of whether the data of the pair are from different distributions (heatmap), and mean difference (digits and hue).
Figure 4: Compositions of model names in the main conference papers at EMNLP 2020 (left) and 2023 (right), arranged counterclockwise by their size. Branches indicate dependency, e.g., GPT-3.5 is based on and fine-tuned from GPT-3. Each group of models with the same root is represented by a unique color shared between graphs.
Figure 5: Jaccard similarity of model compositions between all pairs of conferences.
...and 8 more figures

What We Talk About When We Talk About LMs: Implicit Paradigm Shifts and the Ship of Language Models

TL;DR

Abstract

What We Talk About When We Talk About LMs: Implicit Paradigm Shifts and the Ship of Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)