Table of Contents
Fetching ...

EncouRAGe: Evaluating RAG Local, Fast, and Reliable

Jan Strich, Adeline Scharfenberg, Chris Biemann, Martin Semmann

TL;DR

EncouRAGe introduces a modular Python framework for reproducible, local evaluation of Retrieval-Augmented Generation (RAG) pipelines, organized around five components: Type Manifest, RAG Factory, Inference, Vector Store, and Metrics. It formalizes RAG workflows with mathematical definitions and supports 10 RAG methods and over 20 metrics across generator, retrieval, and LLM-as-a-Judge evaluations, enabling systematic cross-method comparisons. Benchmarking on four diverse datasets (HotPotQA, FeTaQA, FinQA, BioASQ) reveals that Oracle Context remains superior to RAG, while Hybrid BM25 consistently delivers the strongest performance among tested configurations; reranking offers limited gains at the penalty of higher latency. Overall, EncouRAGe's local, extensible framework and open-source design aim to accelerate rigorous, domain-specific RAG research and practical deployment by enabling quick, reproducible benchmarking across datasets and models.

Abstract

We introduce EncouRAGe, a comprehensive Python framework designed to streamline the development and evaluation of Retrieval-Augmented Generation (RAG) systems using Large Language Models (LLMs) and Embedding Models. EncouRAGe comprises five modular and extensible components: Type Manifest, RAG Factory, Inference, Vector Store, and Metrics, facilitating flexible experimentation and extensible development. The framework emphasizes scientific reproducibility, diverse evaluation metrics, and local deployment, enabling researchers to efficiently assess datasets within RAG workflows. This paper presents implementation details and an extensive evaluation across multiple benchmark datasets, including 25k QA pairs and over 51k documents. Our results show that RAG still underperforms compared to the Oracle Context, while Hybrid BM25 consistently achieves the best results across all four datasets. We further examine the effects of reranking, observing only marginal performance improvements accompanied by higher response latency.

EncouRAGe: Evaluating RAG Local, Fast, and Reliable

TL;DR

EncouRAGe introduces a modular Python framework for reproducible, local evaluation of Retrieval-Augmented Generation (RAG) pipelines, organized around five components: Type Manifest, RAG Factory, Inference, Vector Store, and Metrics. It formalizes RAG workflows with mathematical definitions and supports 10 RAG methods and over 20 metrics across generator, retrieval, and LLM-as-a-Judge evaluations, enabling systematic cross-method comparisons. Benchmarking on four diverse datasets (HotPotQA, FeTaQA, FinQA, BioASQ) reveals that Oracle Context remains superior to RAG, while Hybrid BM25 consistently delivers the strongest performance among tested configurations; reranking offers limited gains at the penalty of higher latency. Overall, EncouRAGe's local, extensible framework and open-source design aim to accelerate rigorous, domain-specific RAG research and practical deployment by enabling quick, reproducible benchmarking across datasets and models.

Abstract

We introduce EncouRAGe, a comprehensive Python framework designed to streamline the development and evaluation of Retrieval-Augmented Generation (RAG) systems using Large Language Models (LLMs) and Embedding Models. EncouRAGe comprises five modular and extensible components: Type Manifest, RAG Factory, Inference, Vector Store, and Metrics, facilitating flexible experimentation and extensible development. The framework emphasizes scientific reproducibility, diverse evaluation metrics, and local deployment, enabling researchers to efficiently assess datasets within RAG workflows. This paper presents implementation details and an extensive evaluation across multiple benchmark datasets, including 25k QA pairs and over 51k documents. Our results show that RAG still underperforms compared to the Oracle Context, while Hybrid BM25 consistently achieves the best results across all four datasets. We further examine the effects of reranking, observing only marginal performance improvements accompanied by higher response latency.

Paper Structure

This paper contains 28 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: System overview of the EncouRAGe Python library. Input data in any format must be transformed to fit the Type manifest. The RAG Factory provides various RAG methods, while the Metrics component implements all evaluation metrics. Inference can be executed locally or via cloud providers using the OpenAI SDK, and supported Vector stores include Chroma and Qdrant. The output fits popular monitoring systems.
  • Figure 2: Type Manifest of EncouRAGe. Gold Documents link to the Context, combined via Jinja2 template with the prompt to form the final Prompt Collection.Metadata ensures traceability for documents and prompts.
  • Figure 3: Overview of RAG Factory in EncouRAGe. RAG Factory is organized into three categories: Without RAG, Basic RAG, and Advanced. In total, EncouRAGe supports 10 methods, with more to be added in the future.
  • Figure 4: Comparison of percentage changes for generator (F1/NM) and retrieval metrics (MRR/MAP) for a 2k-sample subset from each dataset (HotPotQA, FeTAQA, FinQA, and BioSQA). Reranking was performed using Jina v3 and Marco MiniLM-L6 v2. The x-axis represents the reranker ratio, and the y-axis shows the percentage change relative to the base RAG method.