Table of Contents
Fetching ...

Leveraging LLM-based agents for social science research: insights from citation network simulations

Jiarui Ji, Runlin Lei, Xuchen Pan, Zhewei Wei, Hao Sun, Yankai Lin, Xu Chen, Yongzheng Yang, Yaliang Li, Bolin Ding, Ji-Rong Wen

TL;DR

The paper introduces CiteAgent, a framework that uses LLM-based agents to simulate social-behavioral processes in citation networks, reproducing key structural phenomena such as power-law in-degree distributions, citational distortion, and shrinking diameter. It establishes two LLM-based paradigms, LLM-SE and LLM-LE, to perform hypothesis-driven analyses of citation decisions and network evolution, validated against real networks and extended through idealized social experiments. The work demonstrates how LLM-driven simulations can test, refine, and challenge theories in science-of-science research, while offering new metrics like Referencing Preference Score to disentangle structural effects from intentional biases. Overall, CiteAgent provides a scalable, reproducible platform for counterfactual and empirical investigation of citation dynamics and offers insights with potential implications for real-world academic environments.

Abstract

The emergence of Large Language Models (LLMs) demonstrates their potential to encapsulate the logic and patterns inherent in human behavior simulation by leveraging extensive web data pre-training. However, the boundaries of LLM capabilities in social simulation remain unclear. To further explore the social attributes of LLMs, we introduce the CiteAgent framework, designed to generate citation networks based on human-behavior simulation with LLM-based agents. CiteAgent successfully captures predominant phenomena in real-world citation networks, including power-law distribution, citational distortion, and shrinking diameter. Building on this realistic simulation, we establish two LLM-based research paradigms in social science: LLM-SE (LLM-based Survey Experiment) and LLM-LE (LLM-based Laboratory Experiment). These paradigms facilitate rigorous analyses of citation network phenomena, allowing us to validate and challenge existing theories. Additionally, we extend the research scope of traditional science of science studies through idealized social experiments, with the simulation experiment results providing valuable insights for real-world academic environments. Our work demonstrates the potential of LLMs for advancing science of science research in social science.

Leveraging LLM-based agents for social science research: insights from citation network simulations

TL;DR

The paper introduces CiteAgent, a framework that uses LLM-based agents to simulate social-behavioral processes in citation networks, reproducing key structural phenomena such as power-law in-degree distributions, citational distortion, and shrinking diameter. It establishes two LLM-based paradigms, LLM-SE and LLM-LE, to perform hypothesis-driven analyses of citation decisions and network evolution, validated against real networks and extended through idealized social experiments. The work demonstrates how LLM-driven simulations can test, refine, and challenge theories in science-of-science research, while offering new metrics like Referencing Preference Score to disentangle structural effects from intentional biases. Overall, CiteAgent provides a scalable, reproducible platform for counterfactual and empirical investigation of citation dynamics and offers insights with potential implications for real-world academic environments.

Abstract

The emergence of Large Language Models (LLMs) demonstrates their potential to encapsulate the logic and patterns inherent in human behavior simulation by leveraging extensive web data pre-training. However, the boundaries of LLM capabilities in social simulation remain unclear. To further explore the social attributes of LLMs, we introduce the CiteAgent framework, designed to generate citation networks based on human-behavior simulation with LLM-based agents. CiteAgent successfully captures predominant phenomena in real-world citation networks, including power-law distribution, citational distortion, and shrinking diameter. Building on this realistic simulation, we establish two LLM-based research paradigms in social science: LLM-SE (LLM-based Survey Experiment) and LLM-LE (LLM-based Laboratory Experiment). These paradigms facilitate rigorous analyses of citation network phenomena, allowing us to validate and challenge existing theories. Additionally, we extend the research scope of traditional science of science studies through idealized social experiments, with the simulation experiment results providing valuable insights for real-world academic environments. Our work demonstrates the potential of LLMs for advancing science of science research in social science.

Paper Structure

This paper contains 13 sections, 2 equations, 6 figures, 2 algorithms.

Figures (6)

  • Figure 1: An Illustration of One Simulation Step in the CiteAgent Framework.a, Initialization: LLM-based agents are built as distinct authors, each endowed with distinct attributes, denoted as $A$; b, Socialization: We designate an active author subset, denoted as $A_a$. Each active agent $a \in A_a$ engages in a group discussion with collaborators and collaboratively develops paper drafts. c, Creation: Each active agent utilizes a scholarly search engine to retrieve relevant papers and finalize the paper drafts with reference selection.
  • Figure 2: Power-Law Distribution in Citation Network Generated by CiteAgent. CiteAgent expands the large dataset Cora and CiteSeer to 5000 nodes, and the LLM-Agent dataset to 1000 nodes. We fit the network in-degree ($k$) distribution and with the power-law distribution model. $k$ is plotted in log-binned and linearly-binned formats against the probability density function $P(k)$ on a log-log scale. $\alpha$ indicates the power-law exponent, $D^*$ denotes a significant level at p-value $< 0.01$, $\bar{k}$ indicates the average $k$.
  • Figure 3: The LLM-LE and LLM-SE Analysis for different LLMs in Forming Power-Law Distributions.a, LLM-LE: $D$ for generated citation networks in all experimental conditions. b, LLM-LE: maximum in-degree for citation networks in all experimental conditions. c, LLM-SE: the predominant paper influencing author reference selection behavior.
  • Figure 4: The LLM-SE and LLM-LE Analysis for the Citational Distortion Phenomenon.a, LLM-SE: the reference selection proportion driven by country-related information, grouped by papers from different countries. b, The citational distortion result figure from gomez2022leading. c, LLM-LE: the $\beta$ evolution trend in different experimental conditions, which shows that preferential attachment causes $\beta$ coefficient exceeding.
  • Figure 5: Examination and Analysis of Citational Distortion.a, A comparison of self-citation rate (SCR) by country using the Scopus dataset, alongside the citation networks generated in the base and equal author experimental conditions. b, A comparison of the Gini coefficient in public and anonymous experimental conditions. c, Comparison of RPS between core countries and peripheral countries in the public experimental condition. d, RPS evolution process in the public experimental condition.
  • ...and 1 more figures