Table of Contents
Fetching ...

InspectorRAGet: An Introspection Platform for RAG Evaluation

Kshitij Fadnis, Siva Sankalp Patel, Odellia Boni, Yannis Katsis, Sara Rosenthal, Benjamin Sznajder, Marina Danilevsky

TL;DR

InspectorRAGet provides a publicly accessible, model- and metric-agnostic platform for end-to-end introspection of Retrieval Augmented Generation outputs. By integrating aggregate and instance-level analyses with both human and algorithmic metrics, plus annotator-quality views, it enables deeper causal and error analyses beyond standard leaderboard metrics. The paper demonstrates two use cases—RAG model performance and LLM-as-a-judge performance—illustrating concrete actionable insights and guiding actions for model improvement, data curation, and evaluation design. This work advances reproducibility and interpretability in RAG evaluation and lays groundwork for extending introspection to other LLM-heavy tasks.

Abstract

Large Language Models (LLM) have become a popular approach for implementing Retrieval Augmented Generation (RAG) systems, and a significant amount of effort has been spent on building good models and metrics. In spite of increased recognition of the need for rigorous evaluation of RAG systems, few tools exist that go beyond the creation of model output and automatic calculation. We present InspectorRAGet, an introspection platform for performing a comprehensive analysis of the quality of RAG system output. InspectorRAGet allows the user to analyze aggregate and instance-level performance of RAG systems, using both human and algorithmic metrics as well as annotator quality. InspectorRAGet is suitable for multiple use cases and is available publicly to the community. A live instance of the platform is available at https://ibm.biz/InspectorRAGet.

InspectorRAGet: An Introspection Platform for RAG Evaluation

TL;DR

InspectorRAGet provides a publicly accessible, model- and metric-agnostic platform for end-to-end introspection of Retrieval Augmented Generation outputs. By integrating aggregate and instance-level analyses with both human and algorithmic metrics, plus annotator-quality views, it enables deeper causal and error analyses beyond standard leaderboard metrics. The paper demonstrates two use cases—RAG model performance and LLM-as-a-judge performance—illustrating concrete actionable insights and guiding actions for model improvement, data curation, and evaluation design. This work advances reproducibility and interpretability in RAG evaluation and lays groundwork for extending introspection to other LLM-heavy tasks.

Abstract

Large Language Models (LLM) have become a popular approach for implementing Retrieval Augmented Generation (RAG) systems, and a significant amount of effort has been spent on building good models and metrics. In spite of increased recognition of the need for rigorous evaluation of RAG systems, few tools exist that go beyond the creation of model output and automatic calculation. We present InspectorRAGet, an introspection platform for performing a comprehensive analysis of the quality of RAG system output. InspectorRAGet allows the user to analyze aggregate and instance-level performance of RAG systems, using both human and algorithmic metrics as well as annotator quality. InspectorRAGet is suitable for multiple use cases and is available publicly to the community. A live instance of the platform is available at https://ibm.biz/InspectorRAGet.
Paper Structure (18 sections, 2 figures, 1 table)

This paper contains 18 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: RAG evaluation life cycle. Evaluations of the RAG output are analyzed using InspectorRAGet.
  • Figure 2: Illustration of InspectorRAGet's core views and corresponding visualizations. Screenshots are drawn from the RAG model performance use case, described in Section \ref{['sec:usecase1']}.