Table of Contents
Fetching ...

A Methodology for Evaluating RAG Systems: A Case Study On Configuration Dependency Validation

Sebastian Simon, Alina Mailach, Johannes Dorn, Norbert Siegmund

TL;DR

This paper proposes a first blueprint of a methodology for a sound and reliable evaluation of RAG systems and demonstrates its applicability on a real-world software engineering research task: the validation of configuration dependencies across software technologies.

Abstract

Retrieval-augmented generation (RAG) is an umbrella of different components, design decisions, and domain-specific adaptations to enhance the capabilities of large language models and counter their limitations regarding hallucination and outdated and missing knowledge. Since it is unclear which design decisions lead to a satisfactory performance, developing RAG systems is often experimental and needs to follow a systematic and sound methodology to gain sound and reliable results. However, there is currently no generally accepted methodology for RAG evaluation despite a growing interest in this technology. In this paper, we propose a first blueprint of a methodology for a sound and reliable evaluation of RAG systems and demonstrate its applicability on a real-world software engineering research task: the validation of configuration dependencies across software technologies. In summary, we make two novel contributions: (i) A novel, reusable methodological design for evaluating RAG systems, including a demonstration that represents a guideline, and (ii) a RAG system, which has been developed following this methodology, that achieves the highest accuracy in the field of dependency validation. For the blueprint's demonstration, the key insights are the crucial role of choosing appropriate baselines and metrics, the necessity for systematic RAG refinements derived from qualitative failure analysis, as well as the reporting practices of key design decision to foster replication and evaluation.

A Methodology for Evaluating RAG Systems: A Case Study On Configuration Dependency Validation

TL;DR

This paper proposes a first blueprint of a methodology for a sound and reliable evaluation of RAG systems and demonstrates its applicability on a real-world software engineering research task: the validation of configuration dependencies across software technologies.

Abstract

Retrieval-augmented generation (RAG) is an umbrella of different components, design decisions, and domain-specific adaptations to enhance the capabilities of large language models and counter their limitations regarding hallucination and outdated and missing knowledge. Since it is unclear which design decisions lead to a satisfactory performance, developing RAG systems is often experimental and needs to follow a systematic and sound methodology to gain sound and reliable results. However, there is currently no generally accepted methodology for RAG evaluation despite a growing interest in this technology. In this paper, we propose a first blueprint of a methodology for a sound and reliable evaluation of RAG systems and demonstrate its applicability on a real-world software engineering research task: the validation of configuration dependencies across software technologies. In summary, we make two novel contributions: (i) A novel, reusable methodological design for evaluating RAG systems, including a demonstration that represents a guideline, and (ii) a RAG system, which has been developed following this methodology, that achieves the highest accuracy in the field of dependency validation. For the blueprint's demonstration, the key insights are the crucial role of choosing appropriate baselines and metrics, the necessity for systematic RAG refinements derived from qualitative failure analysis, as well as the reporting practices of key design decision to foster replication and evaluation.

Paper Structure

This paper contains 42 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: An exemplary cross-technology configuration dependency between Spring Boot and Docker, both specifying the port of the Web server.
  • Figure 2: Key considerations for sound empirical evaluations and refinements of RAG systems
  • Figure 3: The prompt template and its components used for dependency validation.
  • Figure 4: Study design decisions based on the blueprint for evaluating RAG for configuration dependency validation.
  • Figure 5: Usage of context sources per context slot for all 500 dependencies