Can we Evaluate RAGs with Synthetic Data?
Jonas van Elburg, Peter van der Putten, Maarten Marx
TL;DR
The paper tackles the challenge of evaluating Retrieval-Augmented Generation (RAG) systems when ground-truth human benchmarks are unavailable by exploring synthetic QA benchmarks generated by large language models. It compares rankings of RAG variants under synthetic versus human benchmarks across two experiments: one varying retriever configuration and one varying generator architectures, using four datasets (two open-domain, two domain-specific). The findings show that synthetic benchmarks align well with human judgments for retriever-param variation but poorly for generator-architecture comparisons, indicating task-mismatch and stylistic biases as key drivers of misalignment. The study highlights the practical potential of synthetic benchmarks for efficient, domain-specific RAG evaluation—particularly for retrieval tuning—while underscoring the need for careful task design, calibration, and further research to mitigate biases and improve cross-architecture reliability.
Abstract
We investigate whether synthetic question-answer (QA) data generated by large language models (LLMs) can serve as an effective proxy for human-labeled benchmarks when the latter is unavailable. We assess the reliability of synthetic benchmarks across two experiments: one varying retriever parameters while keeping the generator fixed, and another varying the generator with fixed retriever parameters. Across four datasets, of which two open-domain and two proprietary, we find that synthetic benchmarks reliably rank the RAGs varying in terms of retriever configuration, aligning well with human-labeled benchmark baselines. However, they do not consistently produce reliable RAG rankings when comparing generator architectures. The breakdown possibly arises from a combination of task mismatch between the synthetic and human benchmarks, and stylistic bias favoring certain generators.
