Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration

Sunhao Dai; Weihao Liu; Yuqi Zhou; Liang Pang; Rongju Ruan; Gang Wang; Zhenhua Dong; Jun Xu; Ji-Rong Wen

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration

Sunhao Dai, Weihao Liu, Yuqi Zhou, Liang Pang, Rongju Ruan, Gang Wang, Zhenhua Dong, Jun Xu, Ji-Rong Wen

TL;DR

Cocktail addresses the gap in IR benchmarking for mixed human- and LLM-generated content by constructing 16 datasets plus the NQ-UTD test set and evaluating a wide range of retrieval models. Using a BEIR-compatible evaluation framework and an explicit Relative $\Delta$ measure, the study reveals a robust trade-off: neural models achieve higher ranking performance but exhibit increasing bias toward LLM-generated content, a phenomenon that propagates through re-ranking. The work demonstrates semantic preservation between sources via LLM rewrites and provides an open-source toolkit for standardized, reproducible evaluation in the LLM era. These findings guide the development of IR systems that balance strong retrieval with bias mitigation, and the dataset/resource release enables broader community adoption and benchmarking. $\,$

Abstract

The proliferation of Large Language Models (LLMs) has led to an influx of AI-generated content (AIGC) on the internet, transforming the corpus of Information Retrieval (IR) systems from solely human-written to a coexistence with LLM-generated content. The impact of this surge in AIGC on IR systems remains an open question, with the primary challenge being the lack of a dedicated benchmark for researchers. In this paper, we introduce Cocktail, a comprehensive benchmark tailored for evaluating IR models in this mixed-sourced data landscape of the LLM era. Cocktail consists of 16 diverse datasets with mixed human-written and LLM-generated corpora across various text retrieval tasks and domains. Additionally, to avoid the potential bias from previously included dataset information in LLMs, we also introduce an up-to-date dataset, named NQ-UTD, with queries derived from recent events. Through conducting over 1,000 experiments to assess state-of-the-art retrieval models against the benchmarked datasets in Cocktail, we uncover a clear trade-off between ranking performance and source bias in neural retrieval models, highlighting the necessity for a balanced approach in designing future IR systems. We hope Cocktail can serve as a foundational resource for IR research in the LLM era, with all data and code publicly available at \url{https://github.com/KID-22/Cocktail}.

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration

TL;DR

measure, the study reveals a robust trade-off: neural models achieve higher ranking performance but exhibit increasing bias toward LLM-generated content, a phenomenon that propagates through re-ranking. The work demonstrates semantic preservation between sources via LLM rewrites and provides an open-source toolkit for standardized, reproducible evaluation in the LLM era. These findings guide the development of IR systems that balance strong retrieval with bias mitigation, and the dataset/resource release enables broader community adoption and benchmarking.

Abstract

Paper Structure (24 sections, 2 equations, 7 figures, 16 tables)

This paper contains 24 sections, 2 equations, 7 figures, 16 tables.

Introduction
Related Work
Benchmarking Retrieval Datasets
Dataset Construction
Dataset Statistics and Analysis
Benchmarking Evaluation Protocol
Benchmarking Retrieval Models
Retrieval Models
Benchmarked Results
Further Analysis
Conclusion and Future Work
Dataset Details
Detailed Description of Datasets
NQ-UTD
MS MARCO
...and 9 more sections

Figures (7)

Figure 1: Ranking performance versus source bias comparison on averaged results of 16 datasets benchmarked in Cocktail. A more negative Relative $\Delta$ signifies increased source bias towards LLM-generated content. The Pearson correlation coefficient between these two axes is $-0.772$ ($p$-value < 0.05), indicating a strong negative correlation. For brevity, we omit the '%' symbol of the scores in all the tables and figures.
Figure 2: An overview of the dataset construction pipeline involved in Cocktail.
Figure 3: Results of different pooling strategies. "w-mean" denotes weighted mean pooling. A more negative Relative $\Delta$ signifies increased source bias towards LLM-generated content.
Figure 4: Comparison of different model sizes. A more negative Relative $\Delta$ signifies increased source bias towards LLM-generated content.
Figure 5: Distribution of text length of corpus for each dataset in Cocktail.
...and 2 more figures

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration

TL;DR

Abstract

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration

Authors

TL;DR

Abstract

Table of Contents

Figures (7)