Spiral of Silence: How is Large Language Model Killing Information Retrieval? -- A Case Study on Open Domain Question Answering

Xiaoyang Chen; Ben He; Hongyu Lin; Xianpei Han; Tianshu Wang; Boxi Cao; Le Sun; Yingfei Sun

Spiral of Silence: How is Large Language Model Killing Information Retrieval? -- A Case Study on Open Domain Question Answering

Xiaoyang Chen, Ben He, Hongyu Lin, Xianpei Han, Tianshu Wang, Boxi Cao, Le Sun, Yingfei Sun

TL;DR

The paper tackles the problem of how LLM-generated content, when continuously indexed by web retrieval systems, can alter Retrieval-Augmented Generation (RAG) performance in Open-Domain Question Answering (ODQA). It introduces an iterative simulation pipeline that ingests AI-generated texts into corpora and evaluates retrieval and QA across multiple backends, languages, and LLMs. Key findings show immediate retrieval gains from AI content but a long-term degradation of retrieval quality, accompanied by a stable QA level and a rising dominance of LLM content in top results, signaling a Spiral of Silence where human content becomes increasingly marginalized. The work underscores risks to information diversity and reliability in AI-assisted IR and motivates interventions to preserve diversity and accuracy in search ecosystems.

Abstract

The practice of Retrieval-Augmented Generation (RAG), which integrates Large Language Models (LLMs) with retrieval systems, has become increasingly prevalent. However, the repercussions of LLM-derived content infiltrating the web and influencing the retrieval-generation feedback loop are largely uncharted territories. In this study, we construct and iteratively run a simulation pipeline to deeply investigate the short-term and long-term effects of LLM text on RAG systems. Taking the trending Open Domain Question Answering (ODQA) task as a point of entry, our findings reveal a potential digital "Spiral of Silence" effect, with LLM-generated text consistently outperforming human-authored content in search rankings, thereby diminishing the presence and impact of human contributions online. This trend risks creating an imbalanced information ecosystem, where the unchecked proliferation of erroneous LLM-generated content may result in the marginalization of accurate information. We urge the academic community to take heed of this potential issue, ensuring a diverse and authentic digital information landscape.

Spiral of Silence: How is Large Language Model Killing Information Retrieval? -- A Case Study on Open Domain Question Answering

TL;DR

Abstract

Paper Structure (23 sections, 2 equations, 19 figures, 10 tables, 1 algorithm)

This paper contains 23 sections, 2 equations, 19 figures, 10 tables, 1 algorithm.

Introduction
Related Works
Pipeline Construction
Preliminaries
Simulation Process
Experiment
Results
Short-Term Effects on RAG Performance
Long-term Effects on RAG Performance
Spiral of Silence
Effects of "Spiral of Silence" on ODQA
Analysis
Conclusion
Appendix
Discussion of Application on "Spiral of Silence"
...and 8 more sections

Figures (19)

Figure 1: The evolution of RAG systems after introducing LLM-generated texts, where the "Spiral of Silence" effect gradually emerges.
Figure 2: Short-Term QA performance. For each retrieval method, we present both the average performance and the range of variation exhibited by five LLMs. A red dashed line symbolizes the average EM score for zero-shot question generation by LLMs. "Ori." and "+LLM$_Z$" represent the average EM values when models use the original dataset or a dataset enhanced with LLM-generated texts as context, respectively. Retrieval methods are abbreviated: "Contri" for Contriever, "LLM-E" for LLM-Embedder, and "BGE-B" for BGE$_{base}$.
Figure 3: Long-Term RAG performance. The upper section illustrates the retrieval outcomes for various methods, while the lower section depicts the average EM across LLMs. Iteration 1 represents the results following the incorporation of zero-shot LLM-generated text. Abbreviated re-ranking methods in the legend are: +U for UPR, +M for MonoT5, and +BR for BGE-Reranker.
Figure 4: Average percentage of texts from various sources within the top 50 search results over multiple iterations across different search methods. For results on WebQ and TriviaQA, please refer to Figure \ref{['fig:Percentage_app']} in Appendix \ref{['sec:app_fig_wt']}.
Figure 5: 3-gram Self-BLEU score for the top 5 search results over iterations, from the original dataset (Ori.) to subsequent iterations including LLM-generated texts. For results on WebQ and TriviaQA, please refer to Figure \ref{['fig:BLEU_app']} in Appendix \ref{['sec:app_fig_wt']}.
...and 14 more figures

Spiral of Silence: How is Large Language Model Killing Information Retrieval? -- A Case Study on Open Domain Question Answering

TL;DR

Abstract

Spiral of Silence: How is Large Language Model Killing Information Retrieval? -- A Case Study on Open Domain Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (19)