Table of Contents
Fetching ...

AI and the Problem of Knowledge Collapse

Andrew J. Peterson

TL;DR

This work investigates how widespread AI-enabled access to information could induce knowledge collapse, a narrowing of the knowledge distribution away from long-tail ideas. It develops a tail-preserving dynamic where agents choose between traditional learning and discounted AI content, formalizing knowledge collapse as the divergence between the public distribution $p_{ ext{public}}(x)$ and the true distribution $p_{ ext{true}}(x)$, with the distance measured by $H(p_{ ext{public}}(x), p_{ ext{true}}(x))$. The model uses $n$ agents with types $oldsymbol{ heta}_n$ drawn from a lognormal, AI-discounting parameter $oldsymbol{delta}$, truncation $oldsymbol{c2_{ ext{tr}}}$, and learning rate $oldsymbol{b7}$, updating the public distribution via kernel density estimation from $nsamp=100$ samples; tails are emphasized to study long-run effects and potential generational dynamics. Empirically, the paper proposes metrics like Shannon's Diversity Index and Pielou's Evenness Index to quantify output diversity, and provides a case study showing that prompts designed to elicit diverse outputs—e.g., region-based prompts—yield substantially higher diversity across multiple LLMs, underscoring the need for safeguards against recursive AI dependencies and for maintaining tail knowledge in education and policy.

Abstract

While artificial intelligence has the potential to process vast amounts of data, generate new insights, and unlock greater productivity, its widespread adoption may entail unforeseen consequences. We identify conditions under which AI, by reducing the cost of access to certain modes of knowledge, can paradoxically harm public understanding. While large language models are trained on vast amounts of diverse data, they naturally generate output towards the 'center' of the distribution. This is generally useful, but widespread reliance on recursive AI systems could lead to a process we define as "knowledge collapse", and argue this could harm innovation and the richness of human understanding and culture. However, unlike AI models that cannot choose what data they are trained on, humans may strategically seek out diverse forms of knowledge if they perceive them to be worthwhile. To investigate this, we provide a simple model in which a community of learners or innovators choose to use traditional methods or to rely on a discounted AI-assisted process and identify conditions under which knowledge collapse occurs. In our default model, a 20% discount on AI-generated content generates public beliefs 2.3 times further from the truth than when there is no discount. An empirical approach to measuring the distribution of LLM outputs is provided in theoretical terms and illustrated through a specific example comparing the diversity of outputs across different models and prompting styles. Finally, based on the results, we consider further research directions to counteract such outcomes.

AI and the Problem of Knowledge Collapse

TL;DR

This work investigates how widespread AI-enabled access to information could induce knowledge collapse, a narrowing of the knowledge distribution away from long-tail ideas. It develops a tail-preserving dynamic where agents choose between traditional learning and discounted AI content, formalizing knowledge collapse as the divergence between the public distribution and the true distribution , with the distance measured by . The model uses agents with types drawn from a lognormal, AI-discounting parameter , truncation , and learning rate , updating the public distribution via kernel density estimation from samples; tails are emphasized to study long-run effects and potential generational dynamics. Empirically, the paper proposes metrics like Shannon's Diversity Index and Pielou's Evenness Index to quantify output diversity, and provides a case study showing that prompts designed to elicit diverse outputs—e.g., region-based prompts—yield substantially higher diversity across multiple LLMs, underscoring the need for safeguards against recursive AI dependencies and for maintaining tail knowledge in education and policy.

Abstract

While artificial intelligence has the potential to process vast amounts of data, generate new insights, and unlock greater productivity, its widespread adoption may entail unforeseen consequences. We identify conditions under which AI, by reducing the cost of access to certain modes of knowledge, can paradoxically harm public understanding. While large language models are trained on vast amounts of diverse data, they naturally generate output towards the 'center' of the distribution. This is generally useful, but widespread reliance on recursive AI systems could lead to a process we define as "knowledge collapse", and argue this could harm innovation and the richness of human understanding and culture. However, unlike AI models that cannot choose what data they are trained on, humans may strategically seek out diverse forms of knowledge if they perceive them to be worthwhile. To investigate this, we provide a simple model in which a community of learners or innovators choose to use traditional methods or to rely on a discounted AI-assisted process and identify conditions under which knowledge collapse occurs. In our default model, a 20% discount on AI-generated content generates public beliefs 2.3 times further from the truth than when there is no discount. An empirical approach to measuring the distribution of LLM outputs is provided in theoretical terms and illustrated through a specific example comparing the diversity of outputs across different models and prompting styles. Finally, based on the results, we consider further research directions to counteract such outcomes.
Paper Structure (19 sections, 10 equations, 10 figures, 3 tables)

This paper contains 19 sections, 10 equations, 10 figures, 3 tables.

Figures (10)

  • Figure A1: Model collapse example from Shumailov 2023.
  • Figure A2: A hypothetical innovation calculation where the new (n+1)th sample moves the public pdf $0.1$ towards the true distribution.
  • Figure A3: Knowledge collapse: The cheaper it is to rely on AI-generated content, the more extreme the degeneration of public knowledge towards the center.
  • Figure A4: Discount rate and learning rate
  • Figure A5: Discount rate and truncation limits
  • ...and 5 more figures

Theorems & Definitions (5)

  • definition A1: Strong Uniform Representativeness
  • definition A2: Minimal Representativeness
  • definition A3: Task-Pragmatic Representativeness
  • definition A4: Group Proportional Representativeness
  • definition A5: Knowledge Collapse