Table of Contents
Fetching ...

RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

Joshua Gao, Quoc Huy Pham, Subin Varghese, Silwal Saurav, Vedhus Hoskere

TL;DR

RAGalyst addresses the challenge of evaluating domain-specific, safety-critical RAG systems by delivering an automated, human-aligned framework that combines synthetic QA data generation with LLM-based evaluation. The three-module design (document preprocessing, agentic QA generation, and an LLM-guided evaluation module) produces high-quality QA datasets grounded in source documents and assesses RAG components via novel metrics—Answer Correctness and Answerability—enhanced by prompt optimization. Across military operations, cybersecurity, and bridge engineering, the framework reveals strong domain dependence in embedding choices, LLM performance, and retrieval depth, with no single configuration universally optimal, outperforming RAGAS on several criteria. By exposing domain-specific trade-offs and providing scalable benchmarking, RAGalyst enables practitioners to design more reliable, domain-aware RAG systems for high-stakes applications.

Abstract

Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in specialized, safety-critical domains remains a significant challenge. Existing evaluation frameworks often rely on heuristic-based metrics that fail to capture domain-specific nuances and other works utilize LLM-as-a-Judge approaches that lack validated alignment with human judgment. This paper introduces RAGalyst, an automated, human-aligned agentic framework designed for the rigorous evaluation of domain-specific RAG systems. RAGalyst features an agentic pipeline that generates high-quality, synthetic question-answering (QA) datasets from source documents, incorporating an agentic filtering step to ensure data fidelity. The framework refines two key LLM-as-a-Judge metrics-Answer Correctness and Answerability-using prompt optimization to achieve a strong correlation with human annotations. Applying this framework to evaluate various RAG components across three distinct domains (military operations, cybersecurity, and bridge engineering), we find that performance is highly context-dependent. No single embedding model, LLM, or hyperparameter configuration proves universally optimal. Additionally, we provide an analysis on the most common low Answer Correctness reasons in RAG. These findings highlight the necessity of a systematic evaluation framework like RAGalyst, which empowers practitioners to uncover domain-specific trade-offs and make informed design choices for building reliable and effective RAG systems. RAGalyst is available on our Github.

RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

TL;DR

RAGalyst addresses the challenge of evaluating domain-specific, safety-critical RAG systems by delivering an automated, human-aligned framework that combines synthetic QA data generation with LLM-based evaluation. The three-module design (document preprocessing, agentic QA generation, and an LLM-guided evaluation module) produces high-quality QA datasets grounded in source documents and assesses RAG components via novel metrics—Answer Correctness and Answerability—enhanced by prompt optimization. Across military operations, cybersecurity, and bridge engineering, the framework reveals strong domain dependence in embedding choices, LLM performance, and retrieval depth, with no single configuration universally optimal, outperforming RAGAS on several criteria. By exposing domain-specific trade-offs and providing scalable benchmarking, RAGalyst enables practitioners to design more reliable, domain-aware RAG systems for high-stakes applications.

Abstract

Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in specialized, safety-critical domains remains a significant challenge. Existing evaluation frameworks often rely on heuristic-based metrics that fail to capture domain-specific nuances and other works utilize LLM-as-a-Judge approaches that lack validated alignment with human judgment. This paper introduces RAGalyst, an automated, human-aligned agentic framework designed for the rigorous evaluation of domain-specific RAG systems. RAGalyst features an agentic pipeline that generates high-quality, synthetic question-answering (QA) datasets from source documents, incorporating an agentic filtering step to ensure data fidelity. The framework refines two key LLM-as-a-Judge metrics-Answer Correctness and Answerability-using prompt optimization to achieve a strong correlation with human annotations. Applying this framework to evaluate various RAG components across three distinct domains (military operations, cybersecurity, and bridge engineering), we find that performance is highly context-dependent. No single embedding model, LLM, or hyperparameter configuration proves universally optimal. Additionally, we provide an analysis on the most common low Answer Correctness reasons in RAG. These findings highlight the necessity of a systematic evaluation framework like RAGalyst, which empowers practitioners to uncover domain-specific trade-offs and make informed design choices for building reliable and effective RAG systems. RAGalyst is available on our Github.

Paper Structure

This paper contains 38 sections, 3 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of the RAGalyst framework that consists of three modules: a pre-processing module that transforms domain-specific documents into text chunks, a QA generation pipeline for producing synthetic question–answer-context datasets, and an evaluation module for assessing RAG system performance.
  • Figure 2: We ablate both the MIPROv2-optimized Answer Correctness metric and the non-optimized Answerability metric using the LabeledFewShot optimizer. Our results show that Answer Correctness achieves its best performance with 8 examples, whereas LabeledFewShot optimization provides no improvement over our handcrafted prompt for Answerability.
  • Figure 3: We evaluate retrieval with Recall@10 and MRR@10 metrics on a variety of embedding models on three different domains.
  • Figure 4: We evaluate LLM generation with the Answer Correctness, Faithfulness, and Answer Relevancy metrics on three different domains.
  • Figure 5: We ablate the number of chunks retrieved with Gemma3-4B to assess the effect on LLM generation performance on Answer Correctness, Faithfulness and Answer Relevancy. This figure shows the each metric responds differently to the number of chunks retrieved, and that the ideal number of chunks retrieved to maximize Answer Correctness will vary.
  • ...and 1 more figures