RAG Playground: A Framework for Systematic Evaluation of Retrieval Strategies and Prompt Engineering in RAG Systems
Ioannis Papadimitriou, Ilias Gialampoukidis, Stefanos Vrochidis, Ioannis, Kompatsiaris
TL;DR
RAG Playground addresses the need for principled evaluation of retrieval strategies and prompt engineering in Retrieval-Augmented Generation. It introduces an open-source framework that compares naive vector search, reranking, and hybrid retrieval, coupled with ReAct agents and structured prompting, across two LLMs, Llama 3.1 and Qwen 2.5, and a novel Completeness Gain metric within a multi-metric evaluation framework. The experimental results show that hybrid retrieval consistently outperform single-strategy approaches, with Qwen 2.5 achieving up to a 72.7% pass rate and higher numerical accuracy, and structured self-evaluation prompting delivering additional gains. The work demonstrates practical implications for building robust RAG systems on consumer hardware, highlighting retrieval strategy design and prompt engineering as cost-effective levers that can surpass performance gains from larger models, while providing a reusable, local evaluation platform.
Abstract
We present RAG Playground, an open-source framework for systematic evaluation of Retrieval-Augmented Generation (RAG) systems. The framework implements and compares three retrieval approaches: naive vector search, reranking, and hybrid vector-keyword search, combined with ReAct agents using different prompting strategies. We introduce a comprehensive evaluation framework with novel metrics and provide empirical results comparing different language models (Llama 3.1 and Qwen 2.5) across various retrieval configurations. Our experiments demonstrate significant performance improvements through hybrid search methods and structured self-evaluation prompting, achieving up to 72.7% pass rate on our multi-metric evaluation framework. The results also highlight the importance of prompt engineering in RAG systems, with our custom-prompted agents showing consistent improvements in retrieval accuracy and response quality.
