GRAMMAR: Grounded and Modular Methodology for Assessment of Closed-Domain Retrieval-Augmented Language Model

Xinzhe Li; Ming Liu; Shang Gao

GRAMMAR: Grounded and Modular Methodology for Assessment of Closed-Domain Retrieval-Augmented Language Model

Xinzhe Li, Ming Liu, Shang Gao

TL;DR

This work introduces GRAMMAR (GRounded And Modular Methodology for Assessment of RAG), an evaluation framework comprising a grounded data generation process and an evaluation protocol that effectively pinpoints defective modules.

Abstract

Retrieval-Augmented Generation (RAG) systems are widely used across various industries for querying closed-domain and in-house knowledge bases. However, evaluating these systems presents significant challenges due to the private nature of closed-domain data and a scarcity of queries with verifiable ground truths. Moreover, there is a lack of analytical methods to diagnose problematic modules and identify types of failure, such as those caused by knowledge deficits or issues with robustness. To address these challenges, we introduce GRAMMAR (GRounded And Modular Methodology for Assessment of RAG), an evaluation framework comprising a grounded data generation process and an evaluation protocol that effectively pinpoints defective modules. Our validation experiments reveal that GRAMMAR provides a reliable approach for identifying vulnerable modules and supports hypothesis testing for textual form vulnerabilities. An open-source tool accompanying this framework is available in our GitHub repository (see https://github.com/xinzhel/grammar), allowing for easy reproduction of our results and enabling reliable and modular evaluation in closed-domain settings.

GRAMMAR: Grounded and Modular Methodology for Assessment of Closed-Domain Retrieval-Augmented Language Model

TL;DR

Abstract

Paper Structure (62 sections, 8 equations, 3 figures, 8 tables, 1 algorithm)

This paper contains 62 sections, 8 equations, 3 figures, 8 tables, 1 algorithm.

Introduction
Background
Retrieval-augmented Generation (RAG)
Reference-Free Evaluation
Reference-Based Evaluation
Is Reference-free Evaluation Reliable?
Two Evaluation Perspectives: Optimism and Cynicism
Two Evaluation Protocols
Results: Both are Extremely Optimistic on Wrong Predictions While SelfCheck Becomes Too Cynical on Correct Predictions
GRAMMAR
Generating Query Templates
Database Schema
Generating SQL Templates
Generating Text Templates
From Templates To Evaluation Data
...and 47 more sections

Figures (3)

Figure 1: An Example of Applying the GRAMMAR Framework for Modular Evaluation and Hypothesis Testing. The upper section demonstrates the data generation process for creating sets of hypothetically robust ($\mathcal{D}{\text{robust}}$) and non-robust ($\mathcal{D}{\text{non-robust}}$) data. The lower section depicts the evaluation protocol that utilizes the generated data to identify defective modules and facilitate hypothesis testing.
Figure 2: Scaled Generation of Query-answer Pairs. Step 3 and step 4 are independent of each other and depend only on the SQL templates.
Figure 3: Entity-Relationship Diagrams for Data Generation

GRAMMAR: Grounded and Modular Methodology for Assessment of Closed-Domain Retrieval-Augmented Language Model

TL;DR

Abstract

GRAMMAR: Grounded and Modular Methodology for Assessment of Closed-Domain Retrieval-Augmented Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (3)