InspectorRAGet: An Introspection Platform for RAG Evaluation
Kshitij Fadnis, Siva Sankalp Patel, Odellia Boni, Yannis Katsis, Sara Rosenthal, Benjamin Sznajder, Marina Danilevsky
TL;DR
InspectorRAGet provides a publicly accessible, model- and metric-agnostic platform for end-to-end introspection of Retrieval Augmented Generation outputs. By integrating aggregate and instance-level analyses with both human and algorithmic metrics, plus annotator-quality views, it enables deeper causal and error analyses beyond standard leaderboard metrics. The paper demonstrates two use cases—RAG model performance and LLM-as-a-judge performance—illustrating concrete actionable insights and guiding actions for model improvement, data curation, and evaluation design. This work advances reproducibility and interpretability in RAG evaluation and lays groundwork for extending introspection to other LLM-heavy tasks.
Abstract
Large Language Models (LLM) have become a popular approach for implementing Retrieval Augmented Generation (RAG) systems, and a significant amount of effort has been spent on building good models and metrics. In spite of increased recognition of the need for rigorous evaluation of RAG systems, few tools exist that go beyond the creation of model output and automatic calculation. We present InspectorRAGet, an introspection platform for performing a comprehensive analysis of the quality of RAG system output. InspectorRAGet allows the user to analyze aggregate and instance-level performance of RAG systems, using both human and algorithmic metrics as well as annotator quality. InspectorRAGet is suitable for multiple use cases and is available publicly to the community. A live instance of the platform is available at https://ibm.biz/InspectorRAGet.
