Towards Lighter and Robust Evaluation for Retrieval Augmented Generation
Alex-Razvan Ispas, Charles-Elie Simon, Fabien Caspani, Vincent Guigue
TL;DR
This work tackles the cost and opacity of using enterprise LLMs to evaluate Retrieval Augmented Generation (RAG) outputs by proposing a lightweight framework based on quantized open-weight LLMs. It introduces a statement-level evaluation pipeline with a three-part architecture (simplifier, evaluator, parser) to assess correctness and faithfulness, and formalizes two parsing strategies (deterministic regex and constrained JSON-schema parsing) to derive robust metrics. Using Natural Questions and WikiEval datasets, the authors show that 4-bit Llama3 and 9B Gemma2 configurations can achieve performance close to GPT-3.5-Turbo on accuracy-related metrics, with deterministic parsing often offering better stability. The study also introduces an AUC-based alignment measure to relate the continuous evaluator scores to human judgments, highlighting practical implications for reproducible, transparent RAG evaluation while pointing to future work on combining evaluators to further reduce bias and improve agreement.
Abstract
Large Language Models are prompting us to view more NLP tasks from a generative perspective. At the same time, they offer a new way of accessing information, mainly through the RAG framework. While there have been notable improvements for the autoregressive models, overcoming hallucination in the generated answers remains a continuous problem. A standard solution is to use commercial LLMs, such as GPT4, to evaluate these algorithms. However, such frameworks are expensive and not very transparent. Therefore, we propose a study which demonstrates the interest of open-weight models for evaluating RAG hallucination. We develop a lightweight approach using smaller, quantized LLMs to provide an accessible and interpretable metric that gives continuous scores for the generated answer with respect to their correctness and faithfulness. This score allows us to question decisions' reliability and explore thresholds to develop a new AUC metric as an alternative to correlation with human judgment.
