Table of Contents
Fetching ...

How Deep Do Large Language Models Internalize Scientific Literature and Citation Practices?

Andres Algaba, Vincent Holst, Floriano Tori, Melika Mobini, Brecht Verbeken, Sylvia Wenmackers, Vincent Ginis

TL;DR

It is shown that LLMs systematically reinforce the Matthew effect in citations by consistently favoring highly cited papers when generating references, illustrating how LLMs may reshape citation practices and influence the trajectory of scientific discovery by reflecting and amplifying established trends.

Abstract

The spread of scientific knowledge depends on how researchers discover and cite previous work. The adoption of large language models (LLMs) in the scientific research process introduces a new layer to these citation practices. However, it remains unclear to what extent LLMs align with human citation practices, how they perform across domains, and may influence citation dynamics. Here, we show that LLMs systematically reinforce the Matthew effect in citations by consistently favoring highly cited papers when generating references. This pattern persists across scientific domains despite significant field-specific variations in existence rates, which refer to the proportion of generated references that match existing records in external bibliometric databases. Analyzing 274,951 references generated by GPT-4o for 10,000 papers, we find that LLM recommendations diverge from traditional citation patterns by preferring more recent references with shorter titles and fewer authors. Emphasizing their content-level relevance, the generated references are semantically aligned with the content of each paper at levels comparable to the ground truth references and display similar network effects while reducing author self-citations. These findings illustrate how LLMs may reshape citation practices and influence the trajectory of scientific discovery by reflecting and amplifying established trends. As LLMs become more integrated into the scientific research process, it is important to understand their role in shaping how scientific communities discover and build upon prior work.

How Deep Do Large Language Models Internalize Scientific Literature and Citation Practices?

TL;DR

It is shown that LLMs systematically reinforce the Matthew effect in citations by consistently favoring highly cited papers when generating references, illustrating how LLMs may reshape citation practices and influence the trajectory of scientific discovery by reflecting and amplifying established trends.

Abstract

The spread of scientific knowledge depends on how researchers discover and cite previous work. The adoption of large language models (LLMs) in the scientific research process introduces a new layer to these citation practices. However, it remains unclear to what extent LLMs align with human citation practices, how they perform across domains, and may influence citation dynamics. Here, we show that LLMs systematically reinforce the Matthew effect in citations by consistently favoring highly cited papers when generating references. This pattern persists across scientific domains despite significant field-specific variations in existence rates, which refer to the proportion of generated references that match existing records in external bibliometric databases. Analyzing 274,951 references generated by GPT-4o for 10,000 papers, we find that LLM recommendations diverge from traditional citation patterns by preferring more recent references with shorter titles and fewer authors. Emphasizing their content-level relevance, the generated references are semantically aligned with the content of each paper at levels comparable to the ground truth references and display similar network effects while reducing author self-citations. These findings illustrate how LLMs may reshape citation practices and influence the trajectory of scientific discovery by reflecting and amplifying established trends. As LLMs become more integrated into the scientific research process, it is important to understand their role in shaping how scientific communities discover and build upon prior work.

Paper Structure

This paper contains 3 sections, 8 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Overview of our experiment comparing the characteristics of human citations and LLM generated references, when tasked to suggest references based on the title, authors, year, venue, and abstract of a paper. We sample $10,000$ focal papers from all SciSciNet lin2023sciscinet papers which are published in Q1 journals between $1999$ and $2021$, have in between 3 and 54 references, and have at least 1 or more citations (n=$17,538,900$). We prompt GPT-4o to generate suggestions of references based on the title, authors, year, venue, and abstract of a focal paper, where the number of requested generated references corresponds to the ground truth number of references made in the focal paper, which amounts to a total of $274,951$ references. We verify the existence of the generated references via the SciSciNet lin2023sciscinet database and compare the characteristics, such as title length, publication year, venue, number of authors, and semantic embeddings, of the existing and non-existent generated references with the ground truth. For the existing generated references, we also compare additional characteristics, such as the number of citations and references, and analyze the properties of their citation networks.
  • Figure 2: Descriptive statistics of the focal paper sample. This figure summarizes key characteristics of the focal paper sample (n=$10,000$) across fields. a, The distribution of focal papers by field highlights strong representation in the exact sciences (biology, chemistry, computer science, environmental science, engineering, geography, geology, materials science, mathematics, medicine, and physics), with comparatively fewer papers in the humanities (art, history, and philosophy) and the social sciences (business, economics, political science, psychology, and sociology) (Appendix Table \ref{['tab:mapping']}). b, The temporal trend in the number of focal papers exhibits linear growth from 1999 until 2021, which aligns with full SciSciNet lin2023sciscinet database for this period. c,d, Both the median number of references cited per focal paper and the median team size are increasing over time. This pattern is more clear in the fields with a larger number of focal papers (e.g., biology, chemistry, and medicine). The color intensity represents the magnitude of the values: darker shades indicate higher numbers, while lighter shades represent lower values. Hatched cells indicate no data available for a given year and field.
  • Figure 3: Existing generated references reinforce the Matthew effect in citations. This figure displays the existence rate of generated references (gray, n=$274,497$), and the citation characteristics of the ground truth (blue, n=$274,951$) and existing generated (orange, n=$116,939$) references across fields and time. Error bars and shaded bands represent $95\%$ confidence intervals. a, The existence rate of generated references by field of the focal paper shows significantly lower values in the exact sciences compared to the humanities and the social sciences. b, Median citation counts reveal that existing generated references tend to have higher citation counts across all fields, suggesting a preference toward already highly cited works. The pairwise two-sided Wilcoxon signed-rank test at the focal paper level confirms that the existing generated references have a statistically significant higher median citation count for all fields (history, $p$=$0.003$; philosophy, $p$=$0.022$: all other fields, $p$$<$$0.001$). c, The median reference counts tend to be more similar for many fields, with only political science showing existing generated references to have a lower median number reference count. The pairwise two-sided Wilcoxon signed-rank test at the focal paper level shows that the existing generated references have a statistically significant higher median reference count for biology ($p$$<$$0.001$), chemistry ($p$$<$$0.001$), environmental science ($p$$<$$0.001$), geography ($p$=$0.007$), materials science ($p$$<$$0.001$), mathematics ($p$$<$$0.001$), medicine ($p$$<$$0.001$), psychology ($p$$<$$0.001$), and sociology ($p$=$0.002$). All other fields show no statistically significant difference ($p$$>$$0.05$). d, Temporal trends at the focal paper level show that existing generated references consistently exhibit higher median citation counts compared to ground truth references, further emphasizing the reinforcement of the Matthew Effect in citations. e, The overall existence rate of generated references remains consistent across the publication year of the focal paper, fluctuating between $40\%$ and $50\%$.
  • Figure 4: Generated references exhibit a systematic preference for more recent references with shorter titles and fewer authors. This figure summarizes key characteristics for ground truth (blue, n=$274,951$), generated (green, n=$274,497$), existing generated (orange, n=$116,939$), and non-existing generated (red, n=$157,558$) references. a, The relative frequency of publication years within each reference group, with median publication years indicated by vertical lines, shows that generated references are generally more recent than the ground truth. This recency bias is driven by non-existent generated references, which disproportionately cite more recent publications. Existing generated references show a more complex pattern, tempering the overall recency bias in the generated references. The pairwise two-sided Wilcoxon signed-rank test at the focal paper level confirms the statistically significant difference in median publication year between ground truth and generated references ($p$$<$$0.001$). b, The distribution of the number of authors shows that generated references tend to favor documents with fewer authors with a peak aroud 2-3 (1-3 for existing generated references) authors, compared to 2-6 authors for ground truth references. A small proportion of generated references are labeled as "et al." ($3\%$), with higher rates in non-existent ($4\%$) than existing generated references ($1.5\%$). The pairwise two-sided Wilcoxon signed-rank test at the focal paper level confirms the statistically significant difference in the median number of authors between ground truth and generated references ($p$$<$$0.001$). c, The distribution of the title length shows that generated references tend to favor documents with shorter titles. This effect is most outspoken for the existing generated references. The pairwise two-sided Wilcoxon signed-rank test at the focal paper level confirms the statistically significant difference in the median title length between ground truth and generated references ($p$$<$$0.001$). d, The journal rankings show the top 10 journals across different reference groups. The size of each dot represents how relatively frequently that journal appears within its reference group. Journals are connected by solid lines when appearing in all three groups' top 10, dotted lines when appearing in two groups' top 10, and shown in italic font when appearing in only one group's top 10, highlighting the distinct citation patterns across reference types.
  • Figure 5: Generated references exhibit a level of cosine similarity to focal paper titles and abstracts on par with ground truth references, surpassing that of a random ground truth reference list from the same field. This figure displays the distributions of the pairwise cosine similarity between OpenAI text-embedding-3-large vector embeddings (size=$3,072$) of the titles of the ground truth (blue, n=$274,951$), generated (green, n=$274,497$), existing generated (orange, n=$116,939$), and non-existing generated (red, n=$157,558$) references with the title and abstract of their corresponding focal paper (n=$10,000$). As a benchmark, we also compute for each focal paper the pairwise cosine similarity with the reference from a random ground truth reference list from the same field (gray, n=$274,951$).
  • ...and 12 more figures