RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning
Kun Li, Yunxiang Li, Tianhua Zhang, Hongyin Luo, Xixin Wu, James Glass, Helen Meng
TL;DR
RAG-Zeval introduces an end-to-end, rule-guided evaluation framework for RAG outputs that uses reinforcement learning with a ranking objective to train compact LLM evaluators. By formulating faithfulness and correctness as claim-based judgments and generating evaluation trajectories in JSON, the approach achieves strong alignment with human judgments while reducing reliance on large-scale models. It uses Context-Aware Decoding to synthesize ranking references without human annotation and employs curriculum learning to progressively scale the ranking task. Experiments on faithfulness and correctness benchmarks show RAG-Zeval outperforms baselines built on much larger models and offers improved interpretability through its reasoning trajectories, highlighting the practicality of compact, reasoning-driven evaluators for scalable RAG evaluation.
Abstract
Robust evaluation is critical for deploying trustworthy retrieval-augmented generation (RAG) systems. However, current LLM-based evaluation frameworks predominantly rely on directly prompting resource-intensive models with complex multi-stage prompts, underutilizing models' reasoning capabilities and introducing significant computational cost. In this paper, we present RAG-Zeval (RAG-Zero Evaluator), a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task. Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments with detailed explanation in one-pass. We introduce a ranking-based outcome reward mechanism, using preference judgments rather than absolute scores, to address the challenge of obtaining precise pointwise reward signals. To this end, we synthesize the ranking references by generating quality-controlled responses with zero human annotation. Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments and outperforming baselines that rely on LLMs with 10-100 times more parameters. Our approach also exhibits superior interpretability in response evaluation.
