Table of Contents
Fetching ...

RAG Playground: A Framework for Systematic Evaluation of Retrieval Strategies and Prompt Engineering in RAG Systems

Ioannis Papadimitriou, Ilias Gialampoukidis, Stefanos Vrochidis, Ioannis, Kompatsiaris

TL;DR

RAG Playground addresses the need for principled evaluation of retrieval strategies and prompt engineering in Retrieval-Augmented Generation. It introduces an open-source framework that compares naive vector search, reranking, and hybrid retrieval, coupled with ReAct agents and structured prompting, across two LLMs, Llama 3.1 and Qwen 2.5, and a novel Completeness Gain metric within a multi-metric evaluation framework. The experimental results show that hybrid retrieval consistently outperform single-strategy approaches, with Qwen 2.5 achieving up to a 72.7% pass rate and higher numerical accuracy, and structured self-evaluation prompting delivering additional gains. The work demonstrates practical implications for building robust RAG systems on consumer hardware, highlighting retrieval strategy design and prompt engineering as cost-effective levers that can surpass performance gains from larger models, while providing a reusable, local evaluation platform.

Abstract

We present RAG Playground, an open-source framework for systematic evaluation of Retrieval-Augmented Generation (RAG) systems. The framework implements and compares three retrieval approaches: naive vector search, reranking, and hybrid vector-keyword search, combined with ReAct agents using different prompting strategies. We introduce a comprehensive evaluation framework with novel metrics and provide empirical results comparing different language models (Llama 3.1 and Qwen 2.5) across various retrieval configurations. Our experiments demonstrate significant performance improvements through hybrid search methods and structured self-evaluation prompting, achieving up to 72.7% pass rate on our multi-metric evaluation framework. The results also highlight the importance of prompt engineering in RAG systems, with our custom-prompted agents showing consistent improvements in retrieval accuracy and response quality.

RAG Playground: A Framework for Systematic Evaluation of Retrieval Strategies and Prompt Engineering in RAG Systems

TL;DR

RAG Playground addresses the need for principled evaluation of retrieval strategies and prompt engineering in Retrieval-Augmented Generation. It introduces an open-source framework that compares naive vector search, reranking, and hybrid retrieval, coupled with ReAct agents and structured prompting, across two LLMs, Llama 3.1 and Qwen 2.5, and a novel Completeness Gain metric within a multi-metric evaluation framework. The experimental results show that hybrid retrieval consistently outperform single-strategy approaches, with Qwen 2.5 achieving up to a 72.7% pass rate and higher numerical accuracy, and structured self-evaluation prompting delivering additional gains. The work demonstrates practical implications for building robust RAG systems on consumer hardware, highlighting retrieval strategy design and prompt engineering as cost-effective levers that can surpass performance gains from larger models, while providing a reusable, local evaluation platform.

Abstract

We present RAG Playground, an open-source framework for systematic evaluation of Retrieval-Augmented Generation (RAG) systems. The framework implements and compares three retrieval approaches: naive vector search, reranking, and hybrid vector-keyword search, combined with ReAct agents using different prompting strategies. We introduce a comprehensive evaluation framework with novel metrics and provide empirical results comparing different language models (Llama 3.1 and Qwen 2.5) across various retrieval configurations. Our experiments demonstrate significant performance improvements through hybrid search methods and structured self-evaluation prompting, achieving up to 72.7% pass rate on our multi-metric evaluation framework. The results also highlight the importance of prompt engineering in RAG systems, with our custom-prompted agents showing consistent improvements in retrieval accuracy and response quality.

Paper Structure

This paper contains 57 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Performance analysis across different configurations and metrics. Top: Overall performance metrics including mean scores and pass rates. Middle: Detailed comparison of mean scores across all evaluation metrics for different configurations. Bottom: Pass rates by metric showing the percentage of responses exceeding metric-specific thresholds. Results demonstrate consistent advantages of hybrid search and custom ReAct implementations, particularly with the Qwen model.