Table of Contents
Fetching ...

Can we Evaluate RAGs with Synthetic Data?

Jonas van Elburg, Peter van der Putten, Maarten Marx

TL;DR

The paper tackles the challenge of evaluating Retrieval-Augmented Generation (RAG) systems when ground-truth human benchmarks are unavailable by exploring synthetic QA benchmarks generated by large language models. It compares rankings of RAG variants under synthetic versus human benchmarks across two experiments: one varying retriever configuration and one varying generator architectures, using four datasets (two open-domain, two domain-specific). The findings show that synthetic benchmarks align well with human judgments for retriever-param variation but poorly for generator-architecture comparisons, indicating task-mismatch and stylistic biases as key drivers of misalignment. The study highlights the practical potential of synthetic benchmarks for efficient, domain-specific RAG evaluation—particularly for retrieval tuning—while underscoring the need for careful task design, calibration, and further research to mitigate biases and improve cross-architecture reliability.

Abstract

We investigate whether synthetic question-answer (QA) data generated by large language models (LLMs) can serve as an effective proxy for human-labeled benchmarks when the latter is unavailable. We assess the reliability of synthetic benchmarks across two experiments: one varying retriever parameters while keeping the generator fixed, and another varying the generator with fixed retriever parameters. Across four datasets, of which two open-domain and two proprietary, we find that synthetic benchmarks reliably rank the RAGs varying in terms of retriever configuration, aligning well with human-labeled benchmark baselines. However, they do not consistently produce reliable RAG rankings when comparing generator architectures. The breakdown possibly arises from a combination of task mismatch between the synthetic and human benchmarks, and stylistic bias favoring certain generators.

Can we Evaluate RAGs with Synthetic Data?

TL;DR

The paper tackles the challenge of evaluating Retrieval-Augmented Generation (RAG) systems when ground-truth human benchmarks are unavailable by exploring synthetic QA benchmarks generated by large language models. It compares rankings of RAG variants under synthetic versus human benchmarks across two experiments: one varying retriever configuration and one varying generator architectures, using four datasets (two open-domain, two domain-specific). The findings show that synthetic benchmarks align well with human judgments for retriever-param variation but poorly for generator-architecture comparisons, indicating task-mismatch and stylistic biases as key drivers of misalignment. The study highlights the practical potential of synthetic benchmarks for efficient, domain-specific RAG evaluation—particularly for retrieval tuning—while underscoring the need for careful task design, calibration, and further research to mitigate biases and improve cross-architecture reliability.

Abstract

We investigate whether synthetic question-answer (QA) data generated by large language models (LLMs) can serve as an effective proxy for human-labeled benchmarks when the latter is unavailable. We assess the reliability of synthetic benchmarks across two experiments: one varying retriever parameters while keeping the generator fixed, and another varying the generator with fixed retriever parameters. Across four datasets, of which two open-domain and two proprietary, we find that synthetic benchmarks reliably rank the RAGs varying in terms of retriever configuration, aligning well with human-labeled benchmark baselines. However, they do not consistently produce reliable RAG rankings when comparing generator architectures. The breakdown possibly arises from a combination of task mismatch between the synthetic and human benchmarks, and stylistic bias favoring certain generators.

Paper Structure

This paper contains 20 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The average length of synthetic and human questions and reference answers of all datasets. Error bars indicate standard deviations.
  • Figure 2: Kendall’s $\tau$ rank correlation coefficients between human and synthetic benchmarks across four datasets and two experiments. Rank correlation is generally higher in the retriever experiment than the generator experiment. Furthermore, consistency is to an extent dataset- and metric-dependent.
  • Figure 3: Supervised metrics in the retrieval experiment. Error bars represent the variance over the 94-100 datapoints.
  • Figure 4: LLM-based metrics in the retrieval experiment. Error-bars represent the variance over the 94-100 datapoints.
  • Figure 5: Supervised metrics in the LLM choice experiment. Error bars represent the variance over the 94-100 datapoints. The Llama model is missing from the Sales experiment since it was removed from the KB platform since the start of the research project.
  • ...and 1 more figures