Table of Contents
Fetching ...

Large Language Models Reflect Human Citation Patterns with a Heightened Citation Bias

Andres Algaba, Carmen Mazijn, Vincent Holst, Floriano Tori, Sylvia Wenmackers, Vincent Ginis

TL;DR

This work investigates whether Large Language Models (LLMs) generate scholarly references in a way that mirrors human citation patterns while assessing potential biases arising from their parametric knowledge. By prompting GPT-4, GPT-4o, and Claude 3.5 to suggest references for anonymized in-text citations across 166 cs.LG papers published in AAAI, NeurIPS, ICML, and ICLR after GPT-4's knowledge cut-off, and validating against Semantic Scholar, the study analyzes bibliometric properties and citation networks of both existing and non-existent generated references. It finds that LLMs largely reflect human citation patterns but exhibit a heightened bias toward highly cited works, a tendency that persists after controlling for factors like year, title length, venue, and authors, and that extends to multiple models, including those with training data exposure to the target papers. The results suggest that while LLMs can assist in citation generation, they may amplify existing biases such as the Matthew effect, underscoring the need for bias-aware prompting and verification in scholarly workflows.

Abstract

Citation practices are crucial in shaping the structure of scientific knowledge, yet they are often influenced by contemporary norms and biases. The emergence of Large Language Models (LLMs) introduces a new dynamic to these practices. Interestingly, the characteristics and potential biases of references recommended by LLMs that entirely rely on their parametric knowledge, and not on search or retrieval-augmented generation, remain unexplored. Here, we analyze these characteristics in an experiment using a dataset from AAAI, NeurIPS, ICML, and ICLR, published after GPT-4's knowledge cut-off date. In our experiment, LLMs are tasked with suggesting scholarly references for the anonymized in-text citations within these papers. Our findings reveal a remarkable similarity between human and LLM citation patterns, but with a more pronounced high citation bias, which persists even after controlling for publication year, title length, number of authors, and venue. The results hold for both GPT-4, and the more capable models GPT-4o and Claude 3.5 where the papers are part of the training data. Additionally, we observe a large consistency between the characteristics of LLM's existing and non-existent generated references, indicating the model's internalization of citation patterns. By analyzing citation graphs, we show that the references recommended are embedded in the relevant citation context, suggesting an even deeper conceptual internalization of the citation networks. While LLMs can aid in citation generation, they may also amplify existing biases, such as the Matthew effect, and introduce new ones, potentially skewing scientific knowledge dissemination.

Large Language Models Reflect Human Citation Patterns with a Heightened Citation Bias

TL;DR

This work investigates whether Large Language Models (LLMs) generate scholarly references in a way that mirrors human citation patterns while assessing potential biases arising from their parametric knowledge. By prompting GPT-4, GPT-4o, and Claude 3.5 to suggest references for anonymized in-text citations across 166 cs.LG papers published in AAAI, NeurIPS, ICML, and ICLR after GPT-4's knowledge cut-off, and validating against Semantic Scholar, the study analyzes bibliometric properties and citation networks of both existing and non-existent generated references. It finds that LLMs largely reflect human citation patterns but exhibit a heightened bias toward highly cited works, a tendency that persists after controlling for factors like year, title length, venue, and authors, and that extends to multiple models, including those with training data exposure to the target papers. The results suggest that while LLMs can assist in citation generation, they may amplify existing biases such as the Matthew effect, underscoring the need for bias-aware prompting and verification in scholarly workflows.

Abstract

Citation practices are crucial in shaping the structure of scientific knowledge, yet they are often influenced by contemporary norms and biases. The emergence of Large Language Models (LLMs) introduces a new dynamic to these practices. Interestingly, the characteristics and potential biases of references recommended by LLMs that entirely rely on their parametric knowledge, and not on search or retrieval-augmented generation, remain unexplored. Here, we analyze these characteristics in an experiment using a dataset from AAAI, NeurIPS, ICML, and ICLR, published after GPT-4's knowledge cut-off date. In our experiment, LLMs are tasked with suggesting scholarly references for the anonymized in-text citations within these papers. Our findings reveal a remarkable similarity between human and LLM citation patterns, but with a more pronounced high citation bias, which persists even after controlling for publication year, title length, number of authors, and venue. The results hold for both GPT-4, and the more capable models GPT-4o and Claude 3.5 where the papers are part of the training data. Additionally, we observe a large consistency between the characteristics of LLM's existing and non-existent generated references, indicating the model's internalization of citation patterns. By analyzing citation graphs, we show that the references recommended are embedded in the relevant citation context, suggesting an even deeper conceptual internalization of the citation networks. While LLMs can aid in citation generation, they may also amplify existing biases, such as the Matthew effect, and introduce new ones, potentially skewing scientific knowledge dissemination.
Paper Structure (16 sections, 13 figures, 4 tables)

This paper contains 16 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Overview of our experiment evaluating the characteristics and biases of LLM generated references, when tasked to suggest references for anonymized in-text citations. We collect $166$ papers from the cs.LG category on arXiv which are published in the main tracks of AAAI, NeurIPS, ICML, and ICLR, and only appeared available online after GPT-4's knowledge cut-off date. We split the main content, which includes the author information, conference information, abstract, and introduction, from the ground truth references. GPT-4, GPT-4o and Claude 3.5 are prompted to generate suggestions of scholarly references for the anonymized in-text citations in the main content. We verify the existence of the generated references via Semantic Scholar and compare the characteristics, such as title length, publication year, venue, and number of authors, of the existing and non-existent generated references with the ground truth. For the existing generated references, we also compare additional characteristics, such as the number of citations and references, and analyze the properties of their citation networks.
  • Figure 2: Properties of the ground truth and GPT-4 generated introduction references for the vanilla strategy. This figure displays the properties of the ground truth ($n=14,554$, in blue) and GPT-4 generated references ($n=14,554$, in green), further subdividing the generated references into existing ($n=9,376$, in orange) and non-existent ($n=5,178$, in red), from the original data sources of five runs for the vanilla strategy with GPT-4. a, The average percentage of existing generated references in total ($64.4\%$) and for each publication venue under the vanilla and iterative strategy, with dots representing the percentage for each of the five runs with GPT-4. b, The distribution of the number of characters in the title shows some differences between the ground truth (median $62$) and generated (median $58$) references with the non-existent being slightly longer and with a larger variance compared to the existing generations. c, The distribution over time reveals that ground truth references are relatively more recent than generated references, with most references post-$2010$. The temporal distribution of the non-existent generated references aligns more with the ground truth than the existing generated references. d, The distribution of the number of authors demonstrates a disparity between the ground truth and generated references, having median values of three and two, respectively. However, GPT-4 more often generates "et al." which does not allow for an exact computation, especially for the non-existent references. e, The distribution of publication venues shows that for most venues the ground truth has the highest relative representation, followed closely by existing references. The non-existent references deviate more from the ground truth as the proportion of "Others" is substantially larger. f, The distributions of citations for ground truth and existing generated references reveal a substantial citation bias in the generated references with a difference in median citations of $1,326$. g, Finally, the distribution of references shows that ground truth references cite slightly more papers than the existing generated references with a median difference in median references of $6$.
  • Figure 3: The citation bias in existing GPT-4 generated references is not due to the recency of ground truth references. This figure shows that the existing GPT-4 generated references ($n=9,376$, in orange) consistently exhibit a higher citation count compared to their corresponding ground truth ($n=9,376$, in blue) across subperiods. a, The citation counts across time for the ground truth and existing generated references reveal that the most recent references have a relatively low number of citations. The difference in median citations between the existing generated references and their corresponding ground truth references is 1,257. Since the ground truth references are relatively more recent compared to the existing generated references, we examine whether the observed citation bias is related to the recency of ground truth references. b, The distributions of citations by subperiod reveal that the existing generated references consistently exhibit a higher citation count than their corresponding ground truth counterparts. c, The difference in median citations is most pronounced in the early and late subperiods, i.e., $\leq 1988$, 2010-2016, and 2017-2023.
  • Figure 4: The GPT-4 generated references display similar citation network properties as the ground truth references but with a heightened citation bias. This figure displays how the existing GPT-4 generated references ($n=2945$, first run of vanilla strategy) are embedded in the citation network of the focal papers ($m=166$ in total). a, We depict the connections between the focal paper, the ground truth references, and the existing generated references by showing the underlying citation graphs. An arrow from A to B indicates that A cites B. We identify the focal paper (in blue), generated references that appear in the introduction (in green) or in the paper (in yellow), generated references that are linked to ground truth or other generated references (in orange), generated references that are completely isolated (in purple), and ground truth references that are not cited by GPT-4 (in gray). b, The majority of generated references does not appear in the introduction or paper itself, but is somehow connected to the ground truth references as only a small fraction of generated references is completely isolated. c, The heightened citation bias is most emphasized for generated references that appear in the introduction or the paper, with isolated generated references having the lowest number of citations. d, The normalized average clustering coefficients of the ground truth (green and grey nodes) and the existing generated references (green, yellow, orange, and purple nodes) indicate that GPT-4's internalization of citation patterns extends to citation network properties. The clustering coefficient for a node A is given by $\frac{\# \text{triangles through A}}{\# \text{possible triangles through A}}$. The average is computed across the coefficients of all nodes in the respective graph (excl. nodes with coefficient zero) and indicates the tendency of the respective references to appear in clusters. e, The non-isolated generated references are tightly connected to the ground truth references, both on an individual level (Boolean edge density) as well as an aggregate level (edge expansion). The Boolean edge density is the fraction of non-isolated generations (orange nodes) that are connected to at least one ground truth reference (green and grey nodes) per focal paper. The edge expansion between those two sets is defined as the number of edges between the two sets divided by the smallest set size. f, The number of references is similar across all categories, except for the isolated generated references which have substantially less references.
  • Figure B1: $$ Properties of the ground truth and GPT-4 generated introduction references for the iterative strategy are consistent with the properties of the vanilla strategy. This figure displays the properties of the ground truth ($n=5,178$, in blue) and GPT-4 generated references ($n=5,178$, in green), further subdividing the generated references into existing ($n=3,244$, in orange) and non-existent categories ($n=1,934$, in red), from the original data sources of five runs for the iterative strategy with GPT-4. Note that these are the references which are labelled "non-existent" in the vanilla strategy. a, b, c, d, e, f and g, The iterative results exhibit very similar properties to the vanilla results shown in Figure \ref{['fig:main_2']}.
  • ...and 8 more figures