Table of Contents
Fetching ...

Towards Understanding Bias in Synthetic Data for Evaluation

Hossein A. Rahmani, Varsha Ramineni, Emine Yilmaz, Nick Craswell, Bhaskar Mitra

TL;DR

The paper tackles biases introduced when using LLM-generated queries and relevance judgments to build synthetic test collections for IR evaluation. It analyzes a TREC-DL 2023–based dataset with human, GPT-4, and T5 queries and judgments, employing Bland-Altman analysis, KL divergence, and a linear mixed-effects model to quantify biases. Findings show that LLM judgments tend to be slightly more lenient, synthetic collections can overestimate absolute system performance, and bias tends to favor systems similar to the data-generating model, though relative rankings may retain some robustness. The work underscores the importance of human oversight and cross-validation when relying on synthetic data for evaluation, suggesting cautious use and further cross-domain studies.

Abstract

Test collections are crucial for evaluating Information Retrieval (IR) systems. Creating a diverse set of user queries for these collections can be challenging, and obtaining relevance judgments, which indicate how well retrieved documents match a query, is often costly and resource-intensive. Recently, generating synthetic datasets using Large Language Models (LLMs) has gained attention in various applications. While previous work has used LLMs to generate synthetic queries or documents to improve ranking models, using LLMs to create synthetic test collections is still relatively unexplored. Previous work~\cite{rahmani2024synthetic} showed that synthetic test collections have the potential to be used for system evaluation, however, more analysis is needed to validate this claim. In this paper, we thoroughly investigate the reliability of synthetic test collections constructed using LLMs, where LLMs are used to generate synthetic queries, labels, or both. In particular, we examine the potential biases that might occur when such test collections are used for evaluation. We first empirically show the presence of such bias in evaluation results and analyse the effects it might have on system evaluation. We further validate the presence of such bias using a linear mixed-effects model. Our analysis shows that while the effect of bias present in evaluation results obtained using synthetic test collections could be significant, for e.g.~computing absolute system performance, its effect may not be as significant in comparing relative system performance. Codes and data are available at: https://github.com/rahmanidashti/BiasSyntheticData.

Towards Understanding Bias in Synthetic Data for Evaluation

TL;DR

The paper tackles biases introduced when using LLM-generated queries and relevance judgments to build synthetic test collections for IR evaluation. It analyzes a TREC-DL 2023–based dataset with human, GPT-4, and T5 queries and judgments, employing Bland-Altman analysis, KL divergence, and a linear mixed-effects model to quantify biases. Findings show that LLM judgments tend to be slightly more lenient, synthetic collections can overestimate absolute system performance, and bias tends to favor systems similar to the data-generating model, though relative rankings may retain some robustness. The work underscores the importance of human oversight and cross-validation when relying on synthetic data for evaluation, suggesting cautious use and further cross-domain studies.

Abstract

Test collections are crucial for evaluating Information Retrieval (IR) systems. Creating a diverse set of user queries for these collections can be challenging, and obtaining relevance judgments, which indicate how well retrieved documents match a query, is often costly and resource-intensive. Recently, generating synthetic datasets using Large Language Models (LLMs) has gained attention in various applications. While previous work has used LLMs to generate synthetic queries or documents to improve ranking models, using LLMs to create synthetic test collections is still relatively unexplored. Previous work~\cite{rahmani2024synthetic} showed that synthetic test collections have the potential to be used for system evaluation, however, more analysis is needed to validate this claim. In this paper, we thoroughly investigate the reliability of synthetic test collections constructed using LLMs, where LLMs are used to generate synthetic queries, labels, or both. In particular, we examine the potential biases that might occur when such test collections are used for evaluation. We first empirically show the presence of such bias in evaluation results and analyse the effects it might have on system evaluation. We further validate the presence of such bias using a linear mixed-effects model. Our analysis shows that while the effect of bias present in evaluation results obtained using synthetic test collections could be significant, for e.g.~computing absolute system performance, its effect may not be as significant in comparing relative system performance. Codes and data are available at: https://github.com/rahmanidashti/BiasSyntheticData.

Paper Structure

This paper contains 12 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: (a) The percentage of queries based on the number of words in the queries. Real queries are shorter than synthetic queries. (b) Bland-Altman plot to visualize the comparison between LLM and human expert judgments.
  • Figure 2: Distribution of relevance labels
  • Figure 3: Scatter plots of the effectiveness of TREC Deep Learning Track 2023 runs based on the generated synthetic evaluation test collection. Comparison of various human and synthetic configurations using NDCG@10 (top) and MAP (bottom).