Topics, Authors, and Institutions in Large Language Model Research: Trends from 17K arXiv Papers

Rajiv Movva; Sidhika Balachandar; Kenny Peng; Gabriel Agostini; Nikhil Garg; Emma Pierson

Topics, Authors, and Institutions in Large Language Model Research: Trends from 17K arXiv Papers

Rajiv Movva, Sidhika Balachandar, Kenny Peng, Gabriel Agostini, Nikhil Garg, Emma Pierson

TL;DR

This paper analyzes 16,979 LLM-related arXiv papers from 2018–2023 to map shifts in topics, authors, and institutions using bibliometric methods. It shows growth in societal-impact topics and cross-domain applications, an influx of new authors from non-NLP fields, and a decline in Big Tech publishing alongside rising Asian universities. Top-cited papers are split between industry and academia, while cross-country collaboration remains limited, with Microsoft as a notable bridge. These findings inform onboarding, openness, and policy discussions, highlighting the need for open data, interdisciplinary collaboration, and balanced industry–academic ecosystems.

Abstract

Large language models (LLMs) are dramatically influencing AI research, spurring discussions on what has changed so far and how to shape the field's future. To clarify such questions, we analyze a new dataset of 16,979 LLM-related arXiv papers, focusing on recent trends in 2023 vs. 2018-2022. First, we study disciplinary shifts: LLM research increasingly considers societal impacts, evidenced by 20x growth in LLM submissions to the Computers and Society sub-arXiv. An influx of new authors -- half of all first authors in 2023 -- are entering from non-NLP fields of CS, driving disciplinary expansion. Second, we study industry and academic publishing trends. Surprisingly, industry accounts for a smaller publication share in 2023, largely due to reduced output from Google and other Big Tech companies; universities in Asia are publishing more. Third, we study institutional collaboration: while industry-academic collaborations are common, they tend to focus on the same topics that industry focuses on rather than bridging differences. The most prolific institutions are all US- or China-based, but there is very little cross-country collaboration. We discuss implications around (1) how to support the influx of new authors, (2) how industry trends may affect academics, and (3) possible effects of (the lack of) collaboration.

Topics, Authors, and Institutions in Large Language Model Research: Trends from 17K arXiv Papers

TL;DR

Abstract

Paper Structure (33 sections, 11 figures, 6 tables)

This paper contains 33 sections, 11 figures, 6 tables.

Introduction
Methods
Results
Which topics and authors are driving the growth of LLM research?
How have topics shifted in 2023?
LLM papers increasingly involve societal impacts and fields beyond NLP.
The fastest-growing LLM topics cover applications, capabilities, and methods.
Shrinking topics highlight centralization around closed-source models.
Who are the authors driving the expansion of LLM research?
In 2023, nearly half of LLM first authors have not previously published on NLP.
What fields are new authors coming from?
New authors are driving the increased disciplinary diversity of LLM research.
What are the roles of industry & academia?
A large (and growing) majority of LLM papers are published by academic institutions.
Big Tech companies are publishing less, and universities in Asia are publishing more.
...and 18 more sections

Figures (11)

Figure 1: Sub-arXivs growing fastest in fraction of LLM papers.Top: Proportions of LLM-related papers in a sub-arXiv in 2023 (blue) and pre-2023 (red). Bottom: sub-arXivs are sorted by the ratio of these two quantities, representing how much more likely, in 2023, a random paper in the sub-arXiv involves language models. The $x$-axis labels correspond to, respectively: Computers and Society, Robotics, Human-Computer Interaction, Cryptography and Security, Artificial Intelligence, Software Engineering, Computer Vision, Social and Info. Networks, Machine Learning, Sound, Information Retrieval, and Computation and Language.
Figure 2: Mean citation percentile vs. number of LLM papers for the 34 institutions which are in the top 20 of either metric. Some point labels were removed for visual clarity. Citation percentile is defined in §\ref{['sec:methods']}.
Figure 3: Topics which occur most disproportionately among industry (blue) vs. academic (red) papers. Left: Topics are sorted by the ratio $p(\text{topic} \mid \text{industry-only}) / p(\text{topic} \mid \text{academic-only})$, excluding industry-academic collaboration papers and papers with no inferred affiliations. Right: Topic frequencies by group.
Figure 4: Collaborations between the 20 institutions with the most LLM papers. Node area is proportional to number of papers and edge width to number of collaborations between nodes (we show only edges corresponding to $\ge 5$ collaborations). Microsoft collaborates with academic institutions across the U.S. and China (UIUC also has exactly 5 papers with Tsinghua). There are several notable academic collaborations and industry-academic collaborations, especially involving Microsoft, Google, UW, CMU, Stanford, Tsinghua, and Peking.
Figure S1: The overall incidence of LLM-related papers has increased substantially in the last few years, up to 12% of all arXiv CS/Stat submissions since the second quarter of 2023. Papers are identified as LLM-related if their title or abstract contains one of the keywords listed in §\ref{['sec:llm_related_definition']}.
...and 6 more figures

Topics, Authors, and Institutions in Large Language Model Research: Trends from 17K arXiv Papers

TL;DR

Abstract

Topics, Authors, and Institutions in Large Language Model Research: Trends from 17K arXiv Papers

TL;DR

Abstract

Table of Contents

Figures (11)