Neural Retrievers are Biased Towards LLM-Generated Content

Sunhao Dai; Yuqi Zhou; Liang Pang; Weihao Liu; Xiaolin Hu; Yong Liu; Xiao Zhang; Gang Wang; Jun Xu

Neural Retrievers are Biased Towards LLM-Generated Content

Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang, Jun Xu

TL;DR

This work investigates how the flood of LLM-generated content influences information retrieval, revealing a source bias where neural retrievers (and re-rankers) preferentially rank LLM-generated documents. To study this in a realistic setting, the authors construct SciFact+AIGC and NQ320K+AIGC by rewriting human-written seeds with LLMs while preserving semantics, and they validate these datasets with term/semantic analyses and human judgments. They analyze the bias through text compression and perplexity viewpoints, showing LLM-generated text is semantically more concentrated and easier for PLMs to model. Finally, they propose a plug-and-play debiased constraint to mitigate the bias and discuss broader risks and future directions for IR in the LLM era, supported by two new benchmarks for ongoing research.

Abstract

Recently, the emergence of large language models (LLMs) has revolutionized the paradigm of information retrieval (IR) applications, especially in web search, by generating vast amounts of human-like texts on the Internet. As a result, IR systems in the LLM era are facing a new challenge: the indexed documents are now not only written by human beings but also automatically generated by the LLMs. How these LLM-generated documents influence the IR systems is a pressing and still unexplored question. In this work, we conduct a quantitative evaluation of IR models in scenarios where both human-written and LLM-generated texts are involved. Surprisingly, our findings indicate that neural retrieval models tend to rank LLM-generated documents higher. We refer to this category of biases in neural retrievers towards the LLM-generated content as the \textbf{source bias}. Moreover, we discover that this bias is not confined to the first-stage neural retrievers, but extends to the second-stage neural re-rankers. Then, in-depth analyses from the perspective of text compression indicate that LLM-generated texts exhibit more focused semantics with less noise, making it easier for neural retrieval models to semantic match. To mitigate the source bias, we also propose a plug-and-play debiased constraint for the optimization objective, and experimental results show its effectiveness. Finally, we discuss the potential severe concerns stemming from the observed source bias and hope our findings can serve as a critical wake-up call to the IR community and beyond. To facilitate future explorations of IR in the LLM era, the constructed two new benchmarks are available at https://github.com/KID-22/Source-Bias.

Neural Retrievers are Biased Towards LLM-Generated Content

TL;DR

Abstract

Paper Structure (27 sections, 1 theorem, 9 equations, 10 figures, 7 tables)

This paper contains 27 sections, 1 theorem, 9 equations, 10 figures, 7 tables.

Introduction
RQ1: Environment Construction
Notation
Constructing IR Datasets in the LLM Era
Human-Written Corpus
LLM-Generated Corpus
Statistics and Quality Validation of Datasets
Term-based Statistics and Analysis
Semantic-based Statistics and Analysis
Retrieval Performance Evaluation
Human Evaluation
RQ2: Uncovering Source Bias
Evaluation Metrics for Source Bias
Bias in Neural Retrieval Models
Bias in Re-Ranking Stage
...and 12 more sections

Key Result

theorem 1

Given the following conditions: If LLM aligns more closely with BERT than with humans when predicting $d^G$ given $d^H$, such that for any $s \in [S],$ it follows that

Figures (10)

Figure 1: The overview evolution of IR paradigm from the Pre-LLM era to the LLM era.
Figure 2: The overall paradigm of the proposed evaluation framework for IR in the LLM era.
Figure 3: Distribution of term Jaccard similarity and overlap between Llama2-generated and human-written corpora.
Figure 4: Semantic embedding visualization of different corpora on SciFact+AIGC and NQ320K+AIGC datasets.
Figure 5: Distribution of cosine similarity of semantic embedding between Llama2-generated and human-written corpora.
...and 5 more figures

Theorems & Definitions (1)

theorem 1

Neural Retrievers are Biased Towards LLM-Generated Content

TL;DR

Abstract

Neural Retrievers are Biased Towards LLM-Generated Content

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (1)