Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need

Yang Wang; Alberto Garcia Hernandez; Roman Kyslyi; Nicholas Kersting

Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need

Yang Wang, Alberto Garcia Hernandez, Roman Kyslyi, Nicholas Kersting

TL;DR

This study highlights the potential of LLMs as reliable evaluators in closed-domain, closed-ended settings, particularly when human evaluations require significant resources.

Abstract

We present a comprehensive study of answer quality evaluation in Retrieval-Augmented Generation (RAG) applications using vRAG-Eval, a novel grading system that is designed to assess correctness, completeness, and honesty. We further map the grading of quality aspects aforementioned into a binary score, indicating an accept or reject decision, mirroring the intuitive "thumbs-up" or "thumbs-down" gesture commonly used in chat applications. This approach suits factual business contexts where a clear decision opinion is essential. Our assessment applies vRAG-Eval to two Large Language Models (LLMs), evaluating the quality of answers generated by a vanilla RAG application. We compare these evaluations with human expert judgments and find a substantial alignment between GPT-4's assessments and those of human experts, reaching 83% agreement on accept or reject decisions. This study highlights the potential of LLMs as reliable evaluators in closed-domain, closed-ended settings, particularly when human evaluations require significant resources.

Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need

TL;DR

Abstract

Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)