The Chronicles of RAG: The Retriever, the Chunk and the Generator

Paulo Finardi; Leonardo Avila; Rodrigo Castaldoni; Pedro Gengo; Celio Larcher; Marcos Piau; Pablo Costa; Vinicius Caridá

The Chronicles of RAG: The Retriever, the Chunk and the Generator

Paulo Finardi, Leonardo Avila, Rodrigo Castaldoni, Pedro Gengo, Celio Larcher, Marcos Piau, Pablo Costa, Vinicius Caridá

TL;DR

This work provides practical best practices for implementing Retrieval Augmented Generation (RAG) in Brazilian Portuguese, proposing a simple inference pipeline and a rigorous evaluation framework. It systematically compares sparse and dense retrievers, chunking strategies, and multi-stage architectures (including rerankers and hybrid fusion) using questions derived from the Brazilian Portuguese Harry Potter text, evaluated across GPT-4, GPT-4-1106-preview, GPT-3.5-turbo-1106, and Gemini Pro. A key contribution is the Relative Maximum Score metric, enabling direct quantification of how close a given configuration is to a perfect RAG setup and guiding optimization. The study reports substantial gains, notably a 35.4% improvement in MRR@10 over the baseline retriever, a 2.4% gain from input-size tuning, and a final accuracy of 98.61%, along with a concrete architecture and recommendations for RAG deployments in non-English contexts. These results underscore the critical roles of high-quality retrievers, efficient representation learning, and careful prompt and data handling for robust, low-hallucination RAG systems in practical applications.

Abstract

Retrieval Augmented Generation (RAG) has become one of the most popular paradigms for enabling LLMs to access external data, and also as a mechanism for grounding to mitigate against hallucinations. When implementing RAG you can face several challenges like effective integration of retrieval models, efficient representation learning, data diversity, computational efficiency optimization, evaluation, and quality of text generation. Given all these challenges, every day a new technique to improve RAG appears, making it unfeasible to experiment with all combinations for your problem. In this context, this paper presents good practices to implement, optimize, and evaluate RAG for the Brazilian Portuguese language, focusing on the establishment of a simple pipeline for inference and experiments. We explored a diverse set of methods to answer questions about the first Harry Potter book. To generate the answers we used the OpenAI's gpt-4, gpt-4-1106-preview, gpt-3.5-turbo-1106, and Google's Gemini Pro. Focusing on the quality of the retriever, our approach achieved an improvement of MRR@10 by 35.4% compared to the baseline. When optimizing the input size in the application, we observed that it is possible to further enhance it by 2.4%. Finally, we present the complete architecture of the RAG with our recommendations. As result, we moved from a baseline of 57.88% to a maximum relative score of 98.61%.

The Chronicles of RAG: The Retriever, the Chunk and the Generator

TL;DR

Abstract

Paper Structure (23 sections, 2 equations, 15 figures, 9 tables)

This paper contains 23 sections, 2 equations, 15 figures, 9 tables.

Introduction
Data Preparation
How to Evaluate
Relative Maximum Score
Introductory Experiments
Baseline: no context
Long Context
RAG Naive
Advanced Experiments
Retrievers
BM25
ADA-002
Custom ADA-002
Hybrid Search
Reranker
...and 8 more sections

Figures (15)

Figure 1: From a large document (book), chunks were created, and for each chunk, a question and an answer were generated using gpt-4, where the answer is contained within the chunk.
Figure 2: Performance of gpt-4-1106-preview on the Harry Potter dataset, x-axis: spaced at every $1,000$ tokens of input from the document, y-axis: represents the depth at which the answer is located in the document. The greener the better. Image based on Gregory repository greg_test_long_contex.
Figure 3: Average performance analysis of gpt-4-1006-preview using 128k tokens context per answer depth.
Figure 4: 1. Pass the query to the embedding model to represent its semantics as an embedded query vector; 2. Transfer the embedded query vector to vector database or sparse index (BM25); 3. Fetch the top-k relevant chunks, determined by retriever algorithm; 4. Forward the query text and the chunks retrieved to Large Language Model (LLM); 5. Use the LLM to produce a response based on the prompt filled by the retrieved content.
Figure 5: Bi-Encoder Architecture
...and 10 more figures

The Chronicles of RAG: The Retriever, the Chunk and the Generator

TL;DR

Abstract

The Chronicles of RAG: The Retriever, the Chunk and the Generator

Authors

TL;DR

Abstract

Table of Contents

Figures (15)