Table of Contents
Fetching ...

GRAMMAR: Grounded and Modular Methodology for Assessment of Closed-Domain Retrieval-Augmented Language Model

Xinzhe Li, Ming Liu, Shang Gao

TL;DR

This work introduces GRAMMAR (GRounded And Modular Methodology for Assessment of RAG), an evaluation framework comprising a grounded data generation process and an evaluation protocol that effectively pinpoints defective modules.

Abstract

Retrieval-Augmented Generation (RAG) systems are widely used across various industries for querying closed-domain and in-house knowledge bases. However, evaluating these systems presents significant challenges due to the private nature of closed-domain data and a scarcity of queries with verifiable ground truths. Moreover, there is a lack of analytical methods to diagnose problematic modules and identify types of failure, such as those caused by knowledge deficits or issues with robustness. To address these challenges, we introduce GRAMMAR (GRounded And Modular Methodology for Assessment of RAG), an evaluation framework comprising a grounded data generation process and an evaluation protocol that effectively pinpoints defective modules. Our validation experiments reveal that GRAMMAR provides a reliable approach for identifying vulnerable modules and supports hypothesis testing for textual form vulnerabilities. An open-source tool accompanying this framework is available in our GitHub repository (see https://github.com/xinzhel/grammar), allowing for easy reproduction of our results and enabling reliable and modular evaluation in closed-domain settings.

GRAMMAR: Grounded and Modular Methodology for Assessment of Closed-Domain Retrieval-Augmented Language Model

TL;DR

This work introduces GRAMMAR (GRounded And Modular Methodology for Assessment of RAG), an evaluation framework comprising a grounded data generation process and an evaluation protocol that effectively pinpoints defective modules.

Abstract

Retrieval-Augmented Generation (RAG) systems are widely used across various industries for querying closed-domain and in-house knowledge bases. However, evaluating these systems presents significant challenges due to the private nature of closed-domain data and a scarcity of queries with verifiable ground truths. Moreover, there is a lack of analytical methods to diagnose problematic modules and identify types of failure, such as those caused by knowledge deficits or issues with robustness. To address these challenges, we introduce GRAMMAR (GRounded And Modular Methodology for Assessment of RAG), an evaluation framework comprising a grounded data generation process and an evaluation protocol that effectively pinpoints defective modules. Our validation experiments reveal that GRAMMAR provides a reliable approach for identifying vulnerable modules and supports hypothesis testing for textual form vulnerabilities. An open-source tool accompanying this framework is available in our GitHub repository (see https://github.com/xinzhel/grammar), allowing for easy reproduction of our results and enabling reliable and modular evaluation in closed-domain settings.
Paper Structure (62 sections, 8 equations, 3 figures, 8 tables, 1 algorithm)

This paper contains 62 sections, 8 equations, 3 figures, 8 tables, 1 algorithm.

Figures (3)

  • Figure 1: An Example of Applying the GRAMMAR Framework for Modular Evaluation and Hypothesis Testing. The upper section demonstrates the data generation process for creating sets of hypothetically robust ($\mathcal{D}{\text{robust}}$) and non-robust ($\mathcal{D}{\text{non-robust}}$) data. The lower section depicts the evaluation protocol that utilizes the generated data to identify defective modules and facilitate hypothesis testing.
  • Figure 2: Scaled Generation of Query-answer Pairs. Step 3 and step 4 are independent of each other and depend only on the SQL templates.
  • Figure 3: Entity-Relationship Diagrams for Data Generation