Table of Contents
Fetching ...

Epistemic Diversity and Knowledge Collapse in Large Language Models

Dustin Wright, Sarah Masud, Jared Moore, Srishti Yadav, Maria Antoniak, Peter Ebert Christensen, Chan Young Park, Isabelle Augenstein

TL;DR

This paper defines epistemic diversity as variation in real-world claims produced by large language models (LLMs) and develops a claim-centric methodology to measure it. By clustering decomposed, open-ended model outputs into meaning classes and computing Hill-Shannon diversity, the authors conduct a broad empirical study across 27 LLMs, 155 topics, 12 countries, and 200 prompts per model. They find that newer models provide more diverse outputs but remain less diverse than a basic web search, with model size negatively impacting diversity and retrieval-augmented generation (RAG) improving it, though effects vary by country. The work also reveals English-language dominance in knowledge representation relative to local languages and discusses practical implications for mitigating knowledge collapse through RAG design and the use of smaller models, offering a general methodology for future cross-cultural epistemic analyses of LLMs.

Abstract

Large language models (LLMs) tend to generate lexically, semantically, and stylistically homogenous texts. This poses a risk of knowledge collapse, where homogenous LLMs mediate a shrinking in the range of accessible information over time. Existing works on homogenization are limited by a focus on closed-ended multiple-choice setups or fuzzy semantic features, and do not look at trends across time and cultural contexts. To overcome this, we present a new methodology to measure epistemic diversity, i.e., variation in real-world claims in LLM outputs, which we use to perform a broad empirical study of LLM knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations sourced from real user chats. For the topics in our study, we show that while newer models tend to generate more diverse claims, nearly all models are less epistemically diverse than a basic web search. We find that model size has a negative impact on epistemic diversity, while retrieval-augmented generation (RAG) has a positive impact, though the improvement from RAG varies by the cultural context. Finally, compared to a traditional knowledge source (Wikipedia), we find that country-specific claims reflect the English language more than the local one, highlighting a gap in epistemic representation

Epistemic Diversity and Knowledge Collapse in Large Language Models

TL;DR

This paper defines epistemic diversity as variation in real-world claims produced by large language models (LLMs) and develops a claim-centric methodology to measure it. By clustering decomposed, open-ended model outputs into meaning classes and computing Hill-Shannon diversity, the authors conduct a broad empirical study across 27 LLMs, 155 topics, 12 countries, and 200 prompts per model. They find that newer models provide more diverse outputs but remain less diverse than a basic web search, with model size negatively impacting diversity and retrieval-augmented generation (RAG) improving it, though effects vary by country. The work also reveals English-language dominance in knowledge representation relative to local languages and discusses practical implications for mitigating knowledge collapse through RAG design and the use of smaller models, offering a general methodology for future cross-cultural epistemic analyses of LLMs.

Abstract

Large language models (LLMs) tend to generate lexically, semantically, and stylistically homogenous texts. This poses a risk of knowledge collapse, where homogenous LLMs mediate a shrinking in the range of accessible information over time. Existing works on homogenization are limited by a focus on closed-ended multiple-choice setups or fuzzy semantic features, and do not look at trends across time and cultural contexts. To overcome this, we present a new methodology to measure epistemic diversity, i.e., variation in real-world claims in LLM outputs, which we use to perform a broad empirical study of LLM knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations sourced from real user chats. For the topics in our study, we show that while newer models tend to generate more diverse claims, nearly all models are less epistemically diverse than a basic web search. We find that model size has a negative impact on epistemic diversity, while retrieval-augmented generation (RAG) has a positive impact, though the improvement from RAG varies by the cultural context. Finally, compared to a traditional knowledge source (Wikipedia), we find that country-specific claims reflect the English language more than the local one, highlighting a gap in epistemic representation

Paper Structure

This paper contains 36 sections, 3 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: In this work, we measure epistemic diversity -- via variability in claims about the world -- for characterizing knowledge collapse in LLMs.
  • Figure 2: Histograms of the top ten clusters for four topics after generating text, decomposing, and clustering decomposed claims across all models in our study. The frequency of claims in each cluster, $x_{i}$, is represented by the colored bars. By the 10th cluster, $x_{i}$ is halved for all four topics, indicating a large decay rate for $x_{i}$. The top clusters for each topic convey broad and general information for each topic.
  • Figure 3: Epistemic diversity vs. model release date. Each point is a single model, with lines connecting models of approximately the same size across released versions. Error bars are 95% boostrapped confidence intervals based on the HSD of each topic (N=155). Absolute diversity is low for all models compared to a very modest search baseline (top 20 Google search results for each topic). However, for all families except Qwen and most sizes, we see a trend of improved diversity.
  • Figure 4: Heatmap of the Jensen-Shannon divergence (JSD) across models, based on the empirical probability distributions over clusters ($p_{i}$) for each topic. A higher JSD means that the distributions generated by the two models are more different. Open-weight models tend to be more similar to each other than to GPT. All LLMs are more different from the search baseline than to each other, indicating a marked difference in the distribution of information in the search baseline from the LLMs.
  • Figure 5: Average diversity per country across all models with bootstrapped 95% confidence intervals. Bars are sorted according to the difference between RAG and IFT diversity. Countries tend to have similar diversity to each other with instruction fine-tuning only. However, RAG appears to have an uneven impact on different countries, where the US and general topics see the most benefit
  • ...and 3 more figures