ARAGOG: Advanced RAG Output Grading
Matouš Eibich, Shivay Nagpal, Alexander Fred-Ojala
TL;DR
ARAGOG conducts a broad, replication-friendly evaluation of Retrieval-Augmented Generation (RAG) techniques, focusing on retrieval precision and answer similarity. It analyzes seven techniques (Sentence-window retrieval, Document Summary Index, HyDE, Multi-query, MMR, Cohere rerank, LLM rerank) and their combinations on a 423-document AI ArXiv corpus with 107 GPT-4–generated QA pairs, using 10 runs per setup. HyDE and LLM reranking consistently boost retrieval precision, while Sentence-window retrieval achieves the strongest precision; MMR and Cohere rerank offer limited benefits and Multi-query underperforms. The work highlights practical trade-offs (latency, cost) and points to future directions like knowledge-graph augmentation and Auto-RAG, with an emphasis on reproducibility via the ARAGOG repository.
Abstract
Retrieval-Augmented Generation (RAG) is essential for integrating external knowledge into Large Language Model (LLM) outputs. While the literature on RAG is growing, it primarily focuses on systematic reviews and comparisons of new state-of-the-art (SoTA) techniques against their predecessors, with a gap in extensive experimental comparisons. This study begins to address this gap by assessing various RAG methods' impacts on retrieval precision and answer similarity. We found that Hypothetical Document Embedding (HyDE) and LLM reranking significantly enhance retrieval precision. However, Maximal Marginal Relevance (MMR) and Cohere rerank did not exhibit notable advantages over a baseline Naive RAG system, and Multi-query approaches underperformed. Sentence Window Retrieval emerged as the most effective for retrieval precision, despite its variable performance on answer similarity. The study confirms the potential of the Document Summary Index as a competent retrieval approach. All resources related to this research are publicly accessible for further investigation through our GitHub repository ARAGOG (https://github.com/predlico/ARAGOG). We welcome the community to further this exploratory study in RAG systems.
