Table of Contents
Fetching ...

Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

Zackary Rackauckas, Arthur Câmara, Jakub Zavrel

TL;DR

This work introduces RAGElo, an Elo-based evaluation toolkit for comparing Retrieval-Augmented Generation (RAG) pipelines in enterprise question answering. It combines synthetic query generation, LLM-as-a-judge assessments, and pairwise tournament-style comparisons to rank RAG variants without gold-standard answers. The study shows that RAGF generally yields more complete (though not universally more precise) answers than traditional RAG, with moderate alignment to human expert judgments. BM25-based retrieval remains a strong baseline, and the framework demonstrates scalable, automated evaluation that can extend to other domain-specific RAG systems and future improvements in prompts and embeddings.

Abstract

Challenges in the automated evaluation of Retrieval-Augmented Generation (RAG) Question-Answering (QA) systems include hallucination problems in domain-specific knowledge and the lack of gold standard benchmarks for company internal tasks. This results in difficulties in evaluating RAG variations, like RAG-Fusion (RAGF), in the context of a product QA task at Infineon Technologies. To solve these problems, we propose a comprehensive evaluation framework, which leverages Large Language Models (LLMs) to generate large datasets of synthetic queries based on real user queries and in-domain documents, uses LLM-as-a-judge to rate retrieved documents and answers, evaluates the quality of answers, and ranks different variants of Retrieval-Augmented Generation (RAG) agents with RAGElo's automated Elo-based competition. LLM-as-a-judge rating of a random sample of synthetic queries shows a moderate, positive correlation with domain expert scoring in relevance, accuracy, completeness, and precision. While RAGF outperformed RAG in Elo score, a significance analysis against expert annotations also shows that RAGF significantly outperforms RAG in completeness, but underperforms in precision. In addition, Infineon's RAGF assistant demonstrated slightly higher performance in document relevance based on MRR@5 scores. We find that RAGElo positively aligns with the preferences of human annotators, though due caution is still required. Finally, RAGF's approach leads to more complete answers based on expert annotations and better answers overall based on RAGElo's evaluation criteria.

Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

TL;DR

This work introduces RAGElo, an Elo-based evaluation toolkit for comparing Retrieval-Augmented Generation (RAG) pipelines in enterprise question answering. It combines synthetic query generation, LLM-as-a-judge assessments, and pairwise tournament-style comparisons to rank RAG variants without gold-standard answers. The study shows that RAGF generally yields more complete (though not universally more precise) answers than traditional RAG, with moderate alignment to human expert judgments. BM25-based retrieval remains a strong baseline, and the framework demonstrates scalable, automated evaluation that can extend to other domain-specific RAG systems and future improvements in prompts and embeddings.

Abstract

Challenges in the automated evaluation of Retrieval-Augmented Generation (RAG) Question-Answering (QA) systems include hallucination problems in domain-specific knowledge and the lack of gold standard benchmarks for company internal tasks. This results in difficulties in evaluating RAG variations, like RAG-Fusion (RAGF), in the context of a product QA task at Infineon Technologies. To solve these problems, we propose a comprehensive evaluation framework, which leverages Large Language Models (LLMs) to generate large datasets of synthetic queries based on real user queries and in-domain documents, uses LLM-as-a-judge to rate retrieved documents and answers, evaluates the quality of answers, and ranks different variants of Retrieval-Augmented Generation (RAG) agents with RAGElo's automated Elo-based competition. LLM-as-a-judge rating of a random sample of synthetic queries shows a moderate, positive correlation with domain expert scoring in relevance, accuracy, completeness, and precision. While RAGF outperformed RAG in Elo score, a significance analysis against expert annotations also shows that RAGF significantly outperforms RAG in completeness, but underperforms in precision. In addition, Infineon's RAGF assistant demonstrated slightly higher performance in document relevance based on MRR@5 scores. We find that RAGElo positively aligns with the preferences of human annotators, though due caution is still required. Finally, RAGF's approach leads to more complete answers based on expert annotations and better answers overall based on RAGElo's evaluation criteria.
Paper Structure (19 sections, 3 equations, 4 figures, 6 tables)

This paper contains 19 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: A traditional RAG pipeline compared to a RAGF pipeline. While a traditional RAG agent submits only the original query to the search system, a RAGF agent first generates variations of the user query and combines the rankings induced by these queries into a final ranking using RRF. The resulting top-k passages are fed into the LLM for generating the answer to the user's query.
  • Figure 2: Process for creating synthetic queries. We prompt multiple LLM to generate queries based on existing documents. We include some existing user queries in the prompt as few-shot examples.
  • Figure 3: The RAGElo evaluation pipeline. First, documents retrieved by the agents are evaluated pointwise according to their relevance to the user's question. Then, the agents' answers are evaluated pairwise, using the retrieved relevant documents from both agents as reference.
  • Figure 4: Bland-Altman plot to visualize the comparison between LLM-as-a-judge and expert answers.