Table of Contents
Fetching ...

RAGProbe: An Automated Approach for Evaluating RAG Applications

Shangeetha Sivasothy, Scott Barnett, Stefanus Kurniawan, Zafaryab Rasool, Rajesh Vasa

TL;DR

An automated approach for continuously monitoring the health of RAG pipelines, which can be integrated into existing CI/CD pipelines, allowing for improved quality and outperforms the existing state-of-the-art methods.

Abstract

Retrieval Augmented Generation (RAG) is increasingly being used when building Generative AI applications. Evaluating these applications and RAG pipelines is mostly done manually, via a trial and error process. Automating evaluation of RAG pipelines requires overcoming challenges such as context misunderstanding, wrong format, incorrect specificity, and missing content. Prior works therefore focused on improving evaluation metrics as well as enhancing components within the pipeline using available question and answer datasets. However, they have not focused on 1) providing a schema for capturing different types of question-answer pairs or 2) creating a set of templates for generating question-answer pairs that can support automation of RAG pipeline evaluation. In this paper, we present a technique for generating variations in question-answer pairs to trigger failures in RAG pipelines. We validate 5 open-source RAG pipelines using 3 datasets. Our approach revealed the highest failure rates when prompts combine multiple questions: 91% for questions when spanning multiple documents and 78% for questions from a single document; indicating a need for developers to prioritise handling these combined questions. 60% failure rate was observed in academic domain dataset and 53% and 62% failure rates were observed in open-domain datasets. Our automated approach outperforms the existing state-of-the-art methods, by increasing the failure rate by 51% on average per dataset. Our work presents an automated approach for continuously monitoring the health of RAG pipelines, which can be integrated into existing CI/CD pipelines, allowing for improved quality.

RAGProbe: An Automated Approach for Evaluating RAG Applications

TL;DR

An automated approach for continuously monitoring the health of RAG pipelines, which can be integrated into existing CI/CD pipelines, allowing for improved quality and outperforms the existing state-of-the-art methods.

Abstract

Retrieval Augmented Generation (RAG) is increasingly being used when building Generative AI applications. Evaluating these applications and RAG pipelines is mostly done manually, via a trial and error process. Automating evaluation of RAG pipelines requires overcoming challenges such as context misunderstanding, wrong format, incorrect specificity, and missing content. Prior works therefore focused on improving evaluation metrics as well as enhancing components within the pipeline using available question and answer datasets. However, they have not focused on 1) providing a schema for capturing different types of question-answer pairs or 2) creating a set of templates for generating question-answer pairs that can support automation of RAG pipeline evaluation. In this paper, we present a technique for generating variations in question-answer pairs to trigger failures in RAG pipelines. We validate 5 open-source RAG pipelines using 3 datasets. Our approach revealed the highest failure rates when prompts combine multiple questions: 91% for questions when spanning multiple documents and 78% for questions from a single document; indicating a need for developers to prioritise handling these combined questions. 60% failure rate was observed in academic domain dataset and 53% and 62% failure rates were observed in open-domain datasets. Our automated approach outperforms the existing state-of-the-art methods, by increasing the failure rate by 51% on average per dataset. Our work presents an automated approach for continuously monitoring the health of RAG pipelines, which can be integrated into existing CI/CD pipelines, allowing for improved quality.
Paper Structure (28 sections, 5 figures, 9 tables)

This paper contains 28 sections, 5 figures, 9 tables.

Figures (5)

  • Figure 1: RAGProbe: Our automated approach to generate question-answer pairs. Our approach is extensible by adding different evaluation scenarios and different evaluation metrics.
  • Figure 2: Total failure rate combining all 5 RAG pipelines. The failure rate is calculated as the number of failures divided by the total number of questions.
  • Figure 3: Comparison of failure rates across datasets. A higher failure rate is better as it reveals more limitations of the RAG pipelines.
  • Figure 4: Comparison of failure rates across RAG pipelines. A higher failure rate is better as it reveals more limitations of the RAG pipelines.
  • Figure 5: Automated evaluation results per dataset.