Table of Contents
Fetching ...

ARAGOG: Advanced RAG Output Grading

Matouš Eibich, Shivay Nagpal, Alexander Fred-Ojala

TL;DR

ARAGOG conducts a broad, replication-friendly evaluation of Retrieval-Augmented Generation (RAG) techniques, focusing on retrieval precision and answer similarity. It analyzes seven techniques (Sentence-window retrieval, Document Summary Index, HyDE, Multi-query, MMR, Cohere rerank, LLM rerank) and their combinations on a 423-document AI ArXiv corpus with 107 GPT-4–generated QA pairs, using 10 runs per setup. HyDE and LLM reranking consistently boost retrieval precision, while Sentence-window retrieval achieves the strongest precision; MMR and Cohere rerank offer limited benefits and Multi-query underperforms. The work highlights practical trade-offs (latency, cost) and points to future directions like knowledge-graph augmentation and Auto-RAG, with an emphasis on reproducibility via the ARAGOG repository.

Abstract

Retrieval-Augmented Generation (RAG) is essential for integrating external knowledge into Large Language Model (LLM) outputs. While the literature on RAG is growing, it primarily focuses on systematic reviews and comparisons of new state-of-the-art (SoTA) techniques against their predecessors, with a gap in extensive experimental comparisons. This study begins to address this gap by assessing various RAG methods' impacts on retrieval precision and answer similarity. We found that Hypothetical Document Embedding (HyDE) and LLM reranking significantly enhance retrieval precision. However, Maximal Marginal Relevance (MMR) and Cohere rerank did not exhibit notable advantages over a baseline Naive RAG system, and Multi-query approaches underperformed. Sentence Window Retrieval emerged as the most effective for retrieval precision, despite its variable performance on answer similarity. The study confirms the potential of the Document Summary Index as a competent retrieval approach. All resources related to this research are publicly accessible for further investigation through our GitHub repository ARAGOG (https://github.com/predlico/ARAGOG). We welcome the community to further this exploratory study in RAG systems.

ARAGOG: Advanced RAG Output Grading

TL;DR

ARAGOG conducts a broad, replication-friendly evaluation of Retrieval-Augmented Generation (RAG) techniques, focusing on retrieval precision and answer similarity. It analyzes seven techniques (Sentence-window retrieval, Document Summary Index, HyDE, Multi-query, MMR, Cohere rerank, LLM rerank) and their combinations on a 423-document AI ArXiv corpus with 107 GPT-4–generated QA pairs, using 10 runs per setup. HyDE and LLM reranking consistently boost retrieval precision, while Sentence-window retrieval achieves the strongest precision; MMR and Cohere rerank offer limited benefits and Multi-query underperforms. The work highlights practical trade-offs (latency, cost) and points to future directions like knowledge-graph augmentation and Auto-RAG, with an emphasis on reproducibility via the ARAGOG repository.

Abstract

Retrieval-Augmented Generation (RAG) is essential for integrating external knowledge into Large Language Model (LLM) outputs. While the literature on RAG is growing, it primarily focuses on systematic reviews and comparisons of new state-of-the-art (SoTA) techniques against their predecessors, with a gap in extensive experimental comparisons. This study begins to address this gap by assessing various RAG methods' impacts on retrieval precision and answer similarity. We found that Hypothetical Document Embedding (HyDE) and LLM reranking significantly enhance retrieval precision. However, Maximal Marginal Relevance (MMR) and Cohere rerank did not exhibit notable advantages over a baseline Naive RAG system, and Multi-query approaches underperformed. Sentence Window Retrieval emerged as the most effective for retrieval precision, despite its variable performance on answer similarity. The study confirms the potential of the Document Summary Index as a competent retrieval approach. All resources related to this research are publicly accessible for further investigation through our GitHub repository ARAGOG (https://github.com/predlico/ARAGOG). We welcome the community to further this exploratory study in RAG systems.
Paper Structure (29 sections, 8 figures, 6 tables)

This paper contains 29 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: A high-level overview of the workflow within a Retrieval-Augmented Generation (RAG) system. This process diagram shows how a user query is processed by the system to retrieve relevant documents from a database and how these documents inform the generation of a response.
  • Figure 2: The process flow of Hypothetical Document Embedding (HyDE) technique within a Retrieval-Augmented Generation system. The diagram illustrates the steps from the initial query input to the generation of a hypothetical answer and its use in retrieving relevant documents to inform the final generated response.
  • Figure 3: This diagram showcases how multiple similar queries are generated from an initial user query, and how they contribute to retrieving a wider range of relevant documents.
  • Figure 4: This flowchart outlines the reranking process in a RAG system. It illustrates how retrieved documents are further assessed for relevance using a reranking step, which refines the set of documents that will inform the generated response.
  • Figure 5: The visualization of the AI ArXiv dataset preparation process. This diagram shows the selection of papers for question-answer generation, the employment of the full dataset to provide ample noise for the RAG system, and the chunking approaches used to process the documents for the vector database.
  • ...and 3 more figures