Table of Contents
Fetching ...

Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems

Mukul Chhabra, Luigi Medrano, Arush Verma

TL;DR

This work presents a case-aware LLM-as-a-Judge evaluation framework for enterprise multi-turn RAG systems, showing that generic proxy metrics provide ambiguous signals, while the proposed framework exposes enterprise-critical tradeoffs that are actionable for system improvement.

Abstract

Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, where evaluation must reflect operational constraints, structured identifiers (e.g., error codes, versions), and resolution workflows. Existing RAG evaluation frameworks are primarily designed for benchmark-style or single-turn settings and often fail to capture enterprise-specific failure modes such as case misidentification, workflow misalignment, and partial resolution across turns. We present a case-aware LLM-as-a-Judge evaluation framework for enterprise multi-turn RAG systems. The framework evaluates each turn using eight operationally grounded metrics that separate retrieval quality, grounding fidelity, answer utility, precision integrity, and case/workflow alignment. A severity-aware scoring protocol reduces score inflation and improves diagnostic clarity across heterogeneous enterprise cases. The system uses deterministic prompting with strict JSON outputs, enabling scalable batch evaluation, regression testing, and production monitoring. Through a comparative study of two instruction-tuned models across short and long workflows, we show that generic proxy metrics provide ambiguous signals, while the proposed framework exposes enterprise-critical tradeoffs that are actionable for system improvement.

Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems

TL;DR

This work presents a case-aware LLM-as-a-Judge evaluation framework for enterprise multi-turn RAG systems, showing that generic proxy metrics provide ambiguous signals, while the proposed framework exposes enterprise-critical tradeoffs that are actionable for system improvement.

Abstract

Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, where evaluation must reflect operational constraints, structured identifiers (e.g., error codes, versions), and resolution workflows. Existing RAG evaluation frameworks are primarily designed for benchmark-style or single-turn settings and often fail to capture enterprise-specific failure modes such as case misidentification, workflow misalignment, and partial resolution across turns. We present a case-aware LLM-as-a-Judge evaluation framework for enterprise multi-turn RAG systems. The framework evaluates each turn using eight operationally grounded metrics that separate retrieval quality, grounding fidelity, answer utility, precision integrity, and case/workflow alignment. A severity-aware scoring protocol reduces score inflation and improves diagnostic clarity across heterogeneous enterprise cases. The system uses deterministic prompting with strict JSON outputs, enabling scalable batch evaluation, regression testing, and production monitoring. Through a comparative study of two instruction-tuned models across short and long workflows, we show that generic proxy metrics provide ambiguous signals, while the proposed framework exposes enterprise-critical tradeoffs that are actionable for system improvement.
Paper Structure (90 sections, 1 equation, 7 figures, 6 tables)

This paper contains 90 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: End-to-end evaluation pipeline for case-aware LLM-as-a-Judge scoring of multi-turn RAG responses.
  • Figure 2: Example metric weights for $S_{\text{final}}$. Rendered as a wide figure to avoid cut-off in two-column layout.
  • Figure 3: Correlation heatmap across evaluation metrics.
  • Figure 4: Score distributions per metric.
  • Figure 5: Word count distribution of LLM judge justifications per metric.
  • ...and 2 more figures