Table of Contents
Fetching ...

GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering

Sacha Muller, António Loison, Bilel Omrani, Gautier Viaud

TL;DR

This work tackles the challenge of evaluating grounded QA in Retrieval-Augmented Generation by showing that LLM judges can miss important failure modes and that correlation with GPT-4 judgments is not a reliable proxy for practical performance. It introduces GroUSE, a 144-unit-test meta-evaluation benchmark that probes calibration and failure-mode discrimination across 16 scenarios, supplemented by a streamlined pipeline and targeted prompts. The study finds that closed models often outperform open ones on GroUSE, but that finetuning on GPT-4 evaluation traces dramatically improves open-model performance, bringing them closer to GPT-4 and surpassing prior open evaluators on many tests. The results argue for unit-test–driven evaluation to complement correlation-based metrics and demonstrate a practical approach to strengthening automated RAG evaluation tools, with clear implications for safer, more reliable grounded QA systems.

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a common paradigm to use Large Language Models (LLMs) alongside private and up-to-date knowledge bases. In this work, we address the challenges of using LLM-as-a-Judge when evaluating grounded answers generated by RAG systems. To assess the calibration and discrimination capabilities of judge models, we identify 7 generator failure modes and introduce GroUSE (Grounded QA Unitary Scoring of Evaluators), a meta-evaluation benchmark of 144 unit tests. This benchmark reveals that existing automated RAG evaluation frameworks often overlook important failure modes, even when using GPT-4 as a judge. To improve on the current design of automated RAG evaluation frameworks, we propose a novel pipeline and find that while closed models perform well on GroUSE, state-of-the-art open-source judges do not generalize to our proposed criteria, despite strong correlation with GPT-4's judgement. Our findings suggest that correlation with GPT-4 is an incomplete proxy for the practical performance of judge models and should be supplemented with evaluations on unit tests for precise failure mode detection. We further show that finetuning Llama-3 on GPT-4's reasoning traces significantly boosts its evaluation capabilities, improving upon both correlation with GPT-4's evaluations and calibration on reference situations.

GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering

TL;DR

This work tackles the challenge of evaluating grounded QA in Retrieval-Augmented Generation by showing that LLM judges can miss important failure modes and that correlation with GPT-4 judgments is not a reliable proxy for practical performance. It introduces GroUSE, a 144-unit-test meta-evaluation benchmark that probes calibration and failure-mode discrimination across 16 scenarios, supplemented by a streamlined pipeline and targeted prompts. The study finds that closed models often outperform open ones on GroUSE, but that finetuning on GPT-4 evaluation traces dramatically improves open-model performance, bringing them closer to GPT-4 and surpassing prior open evaluators on many tests. The results argue for unit-test–driven evaluation to complement correlation-based metrics and demonstrate a practical approach to strengthening automated RAG evaluation tools, with clear implications for safer, more reliable grounded QA systems.

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a common paradigm to use Large Language Models (LLMs) alongside private and up-to-date knowledge bases. In this work, we address the challenges of using LLM-as-a-Judge when evaluating grounded answers generated by RAG systems. To assess the calibration and discrimination capabilities of judge models, we identify 7 generator failure modes and introduce GroUSE (Grounded QA Unitary Scoring of Evaluators), a meta-evaluation benchmark of 144 unit tests. This benchmark reveals that existing automated RAG evaluation frameworks often overlook important failure modes, even when using GPT-4 as a judge. To improve on the current design of automated RAG evaluation frameworks, we propose a novel pipeline and find that while closed models perform well on GroUSE, state-of-the-art open-source judges do not generalize to our proposed criteria, despite strong correlation with GPT-4's judgement. Our findings suggest that correlation with GPT-4 is an incomplete proxy for the practical performance of judge models and should be supplemented with evaluations on unit tests for precise failure mode detection. We further show that finetuning Llama-3 on GPT-4's reasoning traces significantly boosts its evaluation capabilities, improving upon both correlation with GPT-4's evaluations and calibration on reference situations.
Paper Structure (45 sections, 15 figures, 9 tables)

This paper contains 45 sections, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Simplified extract of four unit tests, all sharing the same question but testing different failure modes thanks to slight variations in the answer and references. The typology of all 16 test types are detailed in Annex \ref{['anx:unit-test-characteristics']}.
  • Figure 2: Metrics and their applicable situations. Answer relevancy is defined only when the answer includes a response. Completeness is evaluated only when the references actually contain an answer to the question. Faithfulness is assessed whenever the answer includes any information (direct response or related information).
  • Figure 3: GroUSE unit-testing of existing solutions for automatic grounded question answering evaluation
  • Figure 4: Characteristics of the 16 test types. Types 1 to 7 don't correspond to any failure mode as they test in various situations the ability of the model to correctly evaluate answers that deserve the highest notes.
  • Figure 5: Evaluation pipeline. Each green square represents a call to an LLM, while the blue dotted square denotes a straightforward computation based on the call's results. The Usefulness and Faithfulness evaluations may be omitted if preceding calls suggest these metrics are not applicable.
  • ...and 10 more figures