Table of Contents
Fetching ...

The Viability of Crowdsourcing for RAG Evaluation

Lukas Gienapp, Tim Hagen, Maik Fröbe, Matthias Hagen, Benno Stein, Martin Potthast, Harrisen Scells

TL;DR

This study assesses the viability of crowdsourcing for evaluating retrieval-augmented generation (RAG) by comparing human and LLM-generated responses and judgments. It introduces CrowdRAG-25, comprising 903 human-written and 903 LLM-generated RAG responses across 301 topics and three discourse styles, plus extensive crowd and LLM pairwise judgments across seven utility dimensions. The findings show that human pairwise judgments, especially when workers are competency-filtered, are reliable and cost-effective, while LLM judgments struggle to consistently match human-ground-truth judgments. Reference-based evaluation fails to robustly differentiate RAG systems, whereas judgment-based crowdsourced evaluation is feasible and informative; however, using LLMs as universal judges yields inconsistent results. Overall, the work supports pursuing judgment-based crowdsourced evaluation for RAG and highlights bullet-style responses as particularly favorable, while providing an open dataset and tools to advance future RAG research.

Abstract

How good are humans at writing and judging responses in retrieval-augmented generation (RAG) scenarios? To answer this question, we investigate the efficacy of crowdsourcing for RAG through two complementary studies: response writing and response utility judgment. We present the Crowd RAG Corpus 2025 (CrowdRAG-25), which consists of 903 human-written and 903 LLM-generated responses for the 301 topics of the TREC RAG'24 track, across the three discourse styles 'bulleted list', 'essay', and 'news'. For a selection of 65 topics, the corpus further contains 47,320 pairwise human judgments and 10,556 pairwise LLM judgments across seven utility dimensions (e.g., coverage and coherence). Our analyses give insights into human writing behavior for RAG and the viability of crowdsourcing for RAG evaluation. Human pairwise judgments provide reliable and cost-effective results compared to LLM-based pairwise or human/LLM-based pointwise judgments, as well as automated comparisons with human-written reference responses. All our data and tools are freely available.

The Viability of Crowdsourcing for RAG Evaluation

TL;DR

This study assesses the viability of crowdsourcing for evaluating retrieval-augmented generation (RAG) by comparing human and LLM-generated responses and judgments. It introduces CrowdRAG-25, comprising 903 human-written and 903 LLM-generated RAG responses across 301 topics and three discourse styles, plus extensive crowd and LLM pairwise judgments across seven utility dimensions. The findings show that human pairwise judgments, especially when workers are competency-filtered, are reliable and cost-effective, while LLM judgments struggle to consistently match human-ground-truth judgments. Reference-based evaluation fails to robustly differentiate RAG systems, whereas judgment-based crowdsourced evaluation is feasible and informative; however, using LLMs as universal judges yields inconsistent results. Overall, the work supports pursuing judgment-based crowdsourced evaluation for RAG and highlights bullet-style responses as particularly favorable, while providing an open dataset and tools to advance future RAG research.

Abstract

How good are humans at writing and judging responses in retrieval-augmented generation (RAG) scenarios? To answer this question, we investigate the efficacy of crowdsourcing for RAG through two complementary studies: response writing and response utility judgment. We present the Crowd RAG Corpus 2025 (CrowdRAG-25), which consists of 903 human-written and 903 LLM-generated responses for the 301 topics of the TREC RAG'24 track, across the three discourse styles 'bulleted list', 'essay', and 'news'. For a selection of 65 topics, the corpus further contains 47,320 pairwise human judgments and 10,556 pairwise LLM judgments across seven utility dimensions (e.g., coverage and coherence). Our analyses give insights into human writing behavior for RAG and the viability of crowdsourcing for RAG evaluation. Human pairwise judgments provide reliable and cost-effective results compared to LLM-based pairwise or human/LLM-based pointwise judgments, as well as automated comparisons with human-written reference responses. All our data and tools are freely available.

Paper Structure

This paper contains 45 sections, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Distribution of space-separated words, citation-separated statements, and unique cited documents, per type and origin ( LLM, Human).
  • Figure 2: Left: probability of a document being cited by rank position; Right: median rank of cited and uncited documents, per response style and origin (/ LLM, / Human).
  • Figure 3: Cumulative proportion of statement/reference pairs per sentence-BLEU score.
  • Figure 4: Relative density of citations over normalized text position per style and origin ( LLM, Human).
  • Figure 5: Distribution of Flesch reading ease index scores for responses and their cited documents , per origin.