Table of Contents
Fetching ...

BERGEN: A Benchmarking Library for Retrieval-Augmented Generation

David Rau, Hervé Déjean, Nadezhda Chirkova, Thibault Formal, Shuai Wang, Vassilina Nikoulina, Stéphane Clinchant

TL;DR

RAG benchmarks are currently fragmented, hindering fair comparison across approaches. BERGEN introduces an open-source, end-to-end benchmarking framework that standardizes data handling, indexing, retrieval, generation, and evaluation across diverse retrievers, rerankers, LLMs, and multilingual datasets, enabling 500+ experiments. The framework emphasizes semantic evaluation via LLMeval ($\text{LLMeval}$) and conducts extensive analyses of metrics, datasets, retrieval impact, LLM size, fine-tuning, and multilingual RAG, revealing that retrieval quality substantially boosts generation and that semantic metrics better capture knowledge-grounded performance than surface metrics. BERGEN thus advances reproducibility and provides practical guidelines for robust RAG evaluation, with significant implications for both English and multilingual knowledge-intensive tasks.

Abstract

Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. In response to the recent popularity of generative LLMs, many RAG approaches have been proposed, which involve an intricate number of different configurations such as evaluation datasets, collections, metrics, retrievers, and LLMs. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments. In an extensive study focusing on QA, we benchmark different state-of-the-art retrievers, rerankers, and LLMs. Additionally, we analyze existing RAG metrics and datasets. Our open-source library BERGEN is available under \url{https://github.com/naver/bergen}.

BERGEN: A Benchmarking Library for Retrieval-Augmented Generation

TL;DR

RAG benchmarks are currently fragmented, hindering fair comparison across approaches. BERGEN introduces an open-source, end-to-end benchmarking framework that standardizes data handling, indexing, retrieval, generation, and evaluation across diverse retrievers, rerankers, LLMs, and multilingual datasets, enabling 500+ experiments. The framework emphasizes semantic evaluation via LLMeval () and conducts extensive analyses of metrics, datasets, retrieval impact, LLM size, fine-tuning, and multilingual RAG, revealing that retrieval quality substantially boosts generation and that semantic metrics better capture knowledge-grounded performance than surface metrics. BERGEN thus advances reproducibility and provides practical guidelines for robust RAG evaluation, with significant implications for both English and multilingual knowledge-intensive tasks.

Abstract

Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. In response to the recent popularity of generative LLMs, many RAG approaches have been proposed, which involve an intricate number of different configurations such as evaluation datasets, collections, metrics, retrievers, and LLMs. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments. In an extensive study focusing on QA, we benchmark different state-of-the-art retrievers, rerankers, and LLMs. Additionally, we analyze existing RAG metrics and datasets. Our open-source library BERGEN is available under \url{https://github.com/naver/bergen}.
Paper Structure (37 sections, 8 figures, 14 tables)

This paper contains 37 sections, 8 figures, 14 tables.

Figures (8)

  • Figure 2: Summary of features in BERGEN. BERGEN enables a reproducible and comprehensive study of state-of-the-art retrievers, rerankers and LLMs in RAG (we conduct 500+ experiments --see Table \ref{['tab:main_table']}).
  • Figure 3: Correlation of different metrics with GPT-4-as-a-judge for datasets with varying reference label lengths (short, medium, and long).
  • Figure 4: Performance gain w/ and w/o retrieval (SPLADE-v3 + reranking (RR) with DeBERTa-v3) on different datasets with SOLAR-10.7B.
  • Figure 5: Impact of retrieval performance on RAG Performance for SOLAR-10.7B on NQ with different ranking systems. RR means with additional re-ranking using DeBERTa-v3.
  • Figure 6: Performance gains w/ and w/o oracle retrieval for LLMs with different sizes. Comparing closed book vs oracle passages averaged over all QA datasets in KILT.
  • ...and 3 more figures