Table of Contents
Fetching ...

SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval

Hossein A. Rahmani, Xi Wang, Emine Yilmaz, Nick Craswell, Bhaskar Mitra, Paul Thomas

TL;DR

The paper tackles the shortage of large-scale ad-hoc passage retrieval datasets by introducing SynDL, a large-scale test collection built from TREC Deep Learning track data with LLM-generated relevance judgments. It implements a three-stage pipeline (Initial Query Assemble, Assessment Pool Generation, Automatic Judgment with LLM) to produce 1,988 queries and 637,063 query–passage labels, and it validates alignment with human judgments through system-ranking correlations. Key findings include high agreement between SynDL-derived rankings and human judgments (Kendall's tau around 0.83–0.86 for NDCG@10 and @100) and evidence that synthetic judgments do not systematically bias toward GPT-based systems. The work enables scalable IR evaluation, supports richer baselines and analysis of synthetic versus human queries, and offers pathways for transfer learning and re-evaluation of existing passage retrieval approaches.

Abstract

Large-scale test collections play a crucial role in Information Retrieval (IR) research. However, according to the Cranfield paradigm and the research into publicly available datasets, the existing information retrieval research studies are commonly developed on small-scale datasets that rely on human assessors for relevance judgments - a time-intensive and expensive process. Recent studies have shown the strong capability of Large Language Models (LLMs) in producing reliable relevance judgments with human accuracy but at a greatly reduced cost. In this paper, to address the missing large-scale ad-hoc document retrieval dataset, we extend the TREC Deep Learning Track (DL) test collection via additional language model synthetic labels to enable researchers to test and evaluate their search systems at a large scale. Specifically, such a test collection includes more than 1,900 test queries from the previous years of tracks. We compare system evaluation with past human labels from past years and find that our synthetically created large-scale test collection can lead to highly correlated system rankings.

SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval

TL;DR

The paper tackles the shortage of large-scale ad-hoc passage retrieval datasets by introducing SynDL, a large-scale test collection built from TREC Deep Learning track data with LLM-generated relevance judgments. It implements a three-stage pipeline (Initial Query Assemble, Assessment Pool Generation, Automatic Judgment with LLM) to produce 1,988 queries and 637,063 query–passage labels, and it validates alignment with human judgments through system-ranking correlations. Key findings include high agreement between SynDL-derived rankings and human judgments (Kendall's tau around 0.83–0.86 for NDCG@10 and @100) and evidence that synthetic judgments do not systematically bias toward GPT-based systems. The work enables scalable IR evaluation, supports richer baselines and analysis of synthetic versus human queries, and offers pathways for transfer learning and re-evaluation of existing passage retrieval approaches.

Abstract

Large-scale test collections play a crucial role in Information Retrieval (IR) research. However, according to the Cranfield paradigm and the research into publicly available datasets, the existing information retrieval research studies are commonly developed on small-scale datasets that rely on human assessors for relevance judgments - a time-intensive and expensive process. Recent studies have shown the strong capability of Large Language Models (LLMs) in producing reliable relevance judgments with human accuracy but at a greatly reduced cost. In this paper, to address the missing large-scale ad-hoc document retrieval dataset, we extend the TREC Deep Learning Track (DL) test collection via additional language model synthetic labels to enable researchers to test and evaluate their search systems at a large scale. Specifically, such a test collection includes more than 1,900 test queries from the previous years of tracks. We compare system evaluation with past human labels from past years and find that our synthetically created large-scale test collection can lead to highly correlated system rankings.
Paper Structure (7 sections, 8 figures, 3 tables)

This paper contains 7 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: System Ranking correlation test between two test collections (DL-19 and SynDL).
  • Figure 2: Scatter plots of the effectiveness of DL-23 runs based on SynDL synthetic queries vs. DL-23 test collection to analyse the bias towards systems using the same language model as the one used in synthetic query construction.
  • Figure 3: System Ranking correlation test between DL-2019 and SynDL.
  • Figure 4: System Ranking correlation test between DL-2020 and SynDL.
  • Figure 5: System Ranking correlation test between DL-2021 and SynDL.
  • ...and 3 more figures