Table of Contents
Fetching ...

A System for Comprehensive Assessment of RAG Frameworks

Mattia Rengo, Senad Beadini, Domenico Alfano, Roberto Abbruzzese

TL;DR

SCARF addresses the need for holistic, end-to-end evaluation of deployed Retrieval-Augmented Generation systems by providing a modular, black-box framework that treats RAG platforms as interchangeable plugins. It enables single-framework and cross-framework benchmarking across diverse deployment configurations, vector stores, and LLM serving strategies, with per-question metrics and optional LLM-based evaluators like EvaluatorGPT. The framework emphasizes adapters for external RAG endpoints, automated test orchestration, and comprehensive result reporting, while acknowledging limitations such as manual query selection and the potential for richer metrics and UI enhancements. Practically, SCARF facilitates realistic, scalable comparisons of RAG deployments, supporting researchers and industry practitioners in selecting and tuning frameworks for specific use cases.

Abstract

Retrieval Augmented Generation (RAG) has emerged as a standard paradigm for enhancing the factual accuracy and contextual relevance of Large Language Models (LLMs) by integrating retrieval mechanisms. However, existing evaluation frameworks fail to provide a holistic black-box approach to assessing RAG systems, especially in real-world deployment scenarios. To address this gap, we introduce SCARF (System for Comprehensive Assessment of RAG Frameworks), a modular and flexible evaluation framework designed to benchmark deployed RAG applications systematically. SCARF provides an end-to-end, black-box evaluation methodology, enabling a limited-effort comparison across diverse RAG frameworks. Our framework supports multiple deployment configurations and facilitates automated testing across vector databases and LLM serving strategies, producing a detailed performance report. Moreover, SCARF integrates practical considerations such as response coherence, providing a scalable and adaptable solution for researchers and industry professionals evaluating RAG applications. Using the REST APIs interface, we demonstrate how SCARF can be applied to real-world scenarios, showcasing its flexibility in assessing different RAG frameworks and configurations. SCARF is available at GitHub repository.

A System for Comprehensive Assessment of RAG Frameworks

TL;DR

SCARF addresses the need for holistic, end-to-end evaluation of deployed Retrieval-Augmented Generation systems by providing a modular, black-box framework that treats RAG platforms as interchangeable plugins. It enables single-framework and cross-framework benchmarking across diverse deployment configurations, vector stores, and LLM serving strategies, with per-question metrics and optional LLM-based evaluators like EvaluatorGPT. The framework emphasizes adapters for external RAG endpoints, automated test orchestration, and comprehensive result reporting, while acknowledging limitations such as manual query selection and the potential for richer metrics and UI enhancements. Practically, SCARF facilitates realistic, scalable comparisons of RAG deployments, supporting researchers and industry practitioners in selecting and tuning frameworks for specific use cases.

Abstract

Retrieval Augmented Generation (RAG) has emerged as a standard paradigm for enhancing the factual accuracy and contextual relevance of Large Language Models (LLMs) by integrating retrieval mechanisms. However, existing evaluation frameworks fail to provide a holistic black-box approach to assessing RAG systems, especially in real-world deployment scenarios. To address this gap, we introduce SCARF (System for Comprehensive Assessment of RAG Frameworks), a modular and flexible evaluation framework designed to benchmark deployed RAG applications systematically. SCARF provides an end-to-end, black-box evaluation methodology, enabling a limited-effort comparison across diverse RAG frameworks. Our framework supports multiple deployment configurations and facilitates automated testing across vector databases and LLM serving strategies, producing a detailed performance report. Moreover, SCARF integrates practical considerations such as response coherence, providing a scalable and adaptable solution for researchers and industry professionals evaluating RAG applications. Using the REST APIs interface, we demonstrate how SCARF can be applied to real-world scenarios, showcasing its flexibility in assessing different RAG frameworks and configurations. SCARF is available at GitHub repository.

Paper Structure

This paper contains 13 sections, 2 figures, 1 table, 1 algorithm.

Figures (2)

  • Figure 1: High-level SCARF architecture showing modular integration points for RAG frameworks, vector databases, and LLM engines.
  • Figure 2: Flow showing how SCARF interact with data and RAG frameworks to produce the output.