Table of Contents
Fetching ...

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration

Sunhao Dai, Weihao Liu, Yuqi Zhou, Liang Pang, Rongju Ruan, Gang Wang, Zhenhua Dong, Jun Xu, Ji-Rong Wen

TL;DR

Cocktail addresses the gap in IR benchmarking for mixed human- and LLM-generated content by constructing 16 datasets plus the NQ-UTD test set and evaluating a wide range of retrieval models. Using a BEIR-compatible evaluation framework and an explicit Relative $\Delta$ measure, the study reveals a robust trade-off: neural models achieve higher ranking performance but exhibit increasing bias toward LLM-generated content, a phenomenon that propagates through re-ranking. The work demonstrates semantic preservation between sources via LLM rewrites and provides an open-source toolkit for standardized, reproducible evaluation in the LLM era. These findings guide the development of IR systems that balance strong retrieval with bias mitigation, and the dataset/resource release enables broader community adoption and benchmarking. $\,$

Abstract

The proliferation of Large Language Models (LLMs) has led to an influx of AI-generated content (AIGC) on the internet, transforming the corpus of Information Retrieval (IR) systems from solely human-written to a coexistence with LLM-generated content. The impact of this surge in AIGC on IR systems remains an open question, with the primary challenge being the lack of a dedicated benchmark for researchers. In this paper, we introduce Cocktail, a comprehensive benchmark tailored for evaluating IR models in this mixed-sourced data landscape of the LLM era. Cocktail consists of 16 diverse datasets with mixed human-written and LLM-generated corpora across various text retrieval tasks and domains. Additionally, to avoid the potential bias from previously included dataset information in LLMs, we also introduce an up-to-date dataset, named NQ-UTD, with queries derived from recent events. Through conducting over 1,000 experiments to assess state-of-the-art retrieval models against the benchmarked datasets in Cocktail, we uncover a clear trade-off between ranking performance and source bias in neural retrieval models, highlighting the necessity for a balanced approach in designing future IR systems. We hope Cocktail can serve as a foundational resource for IR research in the LLM era, with all data and code publicly available at \url{https://github.com/KID-22/Cocktail}.

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration

TL;DR

Cocktail addresses the gap in IR benchmarking for mixed human- and LLM-generated content by constructing 16 datasets plus the NQ-UTD test set and evaluating a wide range of retrieval models. Using a BEIR-compatible evaluation framework and an explicit Relative measure, the study reveals a robust trade-off: neural models achieve higher ranking performance but exhibit increasing bias toward LLM-generated content, a phenomenon that propagates through re-ranking. The work demonstrates semantic preservation between sources via LLM rewrites and provides an open-source toolkit for standardized, reproducible evaluation in the LLM era. These findings guide the development of IR systems that balance strong retrieval with bias mitigation, and the dataset/resource release enables broader community adoption and benchmarking.

Abstract

The proliferation of Large Language Models (LLMs) has led to an influx of AI-generated content (AIGC) on the internet, transforming the corpus of Information Retrieval (IR) systems from solely human-written to a coexistence with LLM-generated content. The impact of this surge in AIGC on IR systems remains an open question, with the primary challenge being the lack of a dedicated benchmark for researchers. In this paper, we introduce Cocktail, a comprehensive benchmark tailored for evaluating IR models in this mixed-sourced data landscape of the LLM era. Cocktail consists of 16 diverse datasets with mixed human-written and LLM-generated corpora across various text retrieval tasks and domains. Additionally, to avoid the potential bias from previously included dataset information in LLMs, we also introduce an up-to-date dataset, named NQ-UTD, with queries derived from recent events. Through conducting over 1,000 experiments to assess state-of-the-art retrieval models against the benchmarked datasets in Cocktail, we uncover a clear trade-off between ranking performance and source bias in neural retrieval models, highlighting the necessity for a balanced approach in designing future IR systems. We hope Cocktail can serve as a foundational resource for IR research in the LLM era, with all data and code publicly available at \url{https://github.com/KID-22/Cocktail}.
Paper Structure (24 sections, 2 equations, 7 figures, 16 tables)

This paper contains 24 sections, 2 equations, 7 figures, 16 tables.

Figures (7)

  • Figure 1: Ranking performance versus source bias comparison on averaged results of 16 datasets benchmarked in Cocktail. A more negative Relative $\Delta$ signifies increased source bias towards LLM-generated content. The Pearson correlation coefficient between these two axes is $-0.772$ ($p$-value < 0.05), indicating a strong negative correlation. For brevity, we omit the '%' symbol of the scores in all the tables and figures.
  • Figure 2: An overview of the dataset construction pipeline involved in Cocktail.
  • Figure 3: Results of different pooling strategies. "w-mean" denotes weighted mean pooling. A more negative Relative $\Delta$ signifies increased source bias towards LLM-generated content.
  • Figure 4: Comparison of different model sizes. A more negative Relative $\Delta$ signifies increased source bias towards LLM-generated content.
  • Figure 5: Distribution of text length of corpus for each dataset in Cocktail.
  • ...and 2 more figures