Table of Contents
Fetching ...

Towards Lighter and Robust Evaluation for Retrieval Augmented Generation

Alex-Razvan Ispas, Charles-Elie Simon, Fabien Caspani, Vincent Guigue

TL;DR

This work tackles the cost and opacity of using enterprise LLMs to evaluate Retrieval Augmented Generation (RAG) outputs by proposing a lightweight framework based on quantized open-weight LLMs. It introduces a statement-level evaluation pipeline with a three-part architecture (simplifier, evaluator, parser) to assess correctness and faithfulness, and formalizes two parsing strategies (deterministic regex and constrained JSON-schema parsing) to derive robust metrics. Using Natural Questions and WikiEval datasets, the authors show that 4-bit Llama3 and 9B Gemma2 configurations can achieve performance close to GPT-3.5-Turbo on accuracy-related metrics, with deterministic parsing often offering better stability. The study also introduces an AUC-based alignment measure to relate the continuous evaluator scores to human judgments, highlighting practical implications for reproducible, transparent RAG evaluation while pointing to future work on combining evaluators to further reduce bias and improve agreement.

Abstract

Large Language Models are prompting us to view more NLP tasks from a generative perspective. At the same time, they offer a new way of accessing information, mainly through the RAG framework. While there have been notable improvements for the autoregressive models, overcoming hallucination in the generated answers remains a continuous problem. A standard solution is to use commercial LLMs, such as GPT4, to evaluate these algorithms. However, such frameworks are expensive and not very transparent. Therefore, we propose a study which demonstrates the interest of open-weight models for evaluating RAG hallucination. We develop a lightweight approach using smaller, quantized LLMs to provide an accessible and interpretable metric that gives continuous scores for the generated answer with respect to their correctness and faithfulness. This score allows us to question decisions' reliability and explore thresholds to develop a new AUC metric as an alternative to correlation with human judgment.

Towards Lighter and Robust Evaluation for Retrieval Augmented Generation

TL;DR

This work tackles the cost and opacity of using enterprise LLMs to evaluate Retrieval Augmented Generation (RAG) outputs by proposing a lightweight framework based on quantized open-weight LLMs. It introduces a statement-level evaluation pipeline with a three-part architecture (simplifier, evaluator, parser) to assess correctness and faithfulness, and formalizes two parsing strategies (deterministic regex and constrained JSON-schema parsing) to derive robust metrics. Using Natural Questions and WikiEval datasets, the authors show that 4-bit Llama3 and 9B Gemma2 configurations can achieve performance close to GPT-3.5-Turbo on accuracy-related metrics, with deterministic parsing often offering better stability. The study also introduces an AUC-based alignment measure to relate the continuous evaluator scores to human judgments, highlighting practical implications for reproducible, transparent RAG evaluation while pointing to future work on combining evaluators to further reduce bias and improve agreement.

Abstract

Large Language Models are prompting us to view more NLP tasks from a generative perspective. At the same time, they offer a new way of accessing information, mainly through the RAG framework. While there have been notable improvements for the autoregressive models, overcoming hallucination in the generated answers remains a continuous problem. A standard solution is to use commercial LLMs, such as GPT4, to evaluate these algorithms. However, such frameworks are expensive and not very transparent. Therefore, we propose a study which demonstrates the interest of open-weight models for evaluating RAG hallucination. We develop a lightweight approach using smaller, quantized LLMs to provide an accessible and interpretable metric that gives continuous scores for the generated answer with respect to their correctness and faithfulness. This score allows us to question decisions' reliability and explore thresholds to develop a new AUC metric as an alternative to correlation with human judgment.

Paper Structure

This paper contains 30 sections, 4 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Evaluation pipeline for answer correctness. First, the simplifier extracts the statements of the answer and the ones of the ground truth. Afterwards, the evaluator labels the statements according to the definitions. Finally, the parser extracts the labelled statements and calculates the metric.
  • Figure 2: The density distribution plots of the correctness evaluators that use the second regular expression as parser. The distribution of the correct and incorrect answers are marked with blue and red, respectively. The labels were chosen according to the human annotations.
  • Figure 3: The density distribution plots of the faithfulness evaluators that use the second regular expression as parsing. The distribution of the faithful and unfaithful answers scores are marked with blue and red, respectively. The labels were chosen according to the human annotations.
  • Figure 4: The evaluation pipeline for faithfulness. Firstly, the LLM extracts the statements of the answer. Afterwards, given the context, the statements are labelled according to the definition from Section \ref{['appendix:confusion_matrix_faithfulness']}. Finally, the parser extracts the label matches and calculates the metric.
  • Figure 5: The parsing strategies for extracting the labelled statements. The deterministic parsing uses regular expressions to match the labels. The constrained generation parsing collects the labelled statements in a JSON schema and then returns the number of matches for each label.
  • ...and 1 more figures