Table of Contents
Fetching ...

RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning

Kun Li, Yunxiang Li, Tianhua Zhang, Hongyin Luo, Xixin Wu, James Glass, Helen Meng

TL;DR

RAG-Zeval introduces an end-to-end, rule-guided evaluation framework for RAG outputs that uses reinforcement learning with a ranking objective to train compact LLM evaluators. By formulating faithfulness and correctness as claim-based judgments and generating evaluation trajectories in JSON, the approach achieves strong alignment with human judgments while reducing reliance on large-scale models. It uses Context-Aware Decoding to synthesize ranking references without human annotation and employs curriculum learning to progressively scale the ranking task. Experiments on faithfulness and correctness benchmarks show RAG-Zeval outperforms baselines built on much larger models and offers improved interpretability through its reasoning trajectories, highlighting the practicality of compact, reasoning-driven evaluators for scalable RAG evaluation.

Abstract

Robust evaluation is critical for deploying trustworthy retrieval-augmented generation (RAG) systems. However, current LLM-based evaluation frameworks predominantly rely on directly prompting resource-intensive models with complex multi-stage prompts, underutilizing models' reasoning capabilities and introducing significant computational cost. In this paper, we present RAG-Zeval (RAG-Zero Evaluator), a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task. Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments with detailed explanation in one-pass. We introduce a ranking-based outcome reward mechanism, using preference judgments rather than absolute scores, to address the challenge of obtaining precise pointwise reward signals. To this end, we synthesize the ranking references by generating quality-controlled responses with zero human annotation. Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments and outperforming baselines that rely on LLMs with 10-100 times more parameters. Our approach also exhibits superior interpretability in response evaluation.

RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning

TL;DR

RAG-Zeval introduces an end-to-end, rule-guided evaluation framework for RAG outputs that uses reinforcement learning with a ranking objective to train compact LLM evaluators. By formulating faithfulness and correctness as claim-based judgments and generating evaluation trajectories in JSON, the approach achieves strong alignment with human judgments while reducing reliance on large-scale models. It uses Context-Aware Decoding to synthesize ranking references without human annotation and employs curriculum learning to progressively scale the ranking task. Experiments on faithfulness and correctness benchmarks show RAG-Zeval outperforms baselines built on much larger models and offers improved interpretability through its reasoning trajectories, highlighting the practicality of compact, reasoning-driven evaluators for scalable RAG evaluation.

Abstract

Robust evaluation is critical for deploying trustworthy retrieval-augmented generation (RAG) systems. However, current LLM-based evaluation frameworks predominantly rely on directly prompting resource-intensive models with complex multi-stage prompts, underutilizing models' reasoning capabilities and introducing significant computational cost. In this paper, we present RAG-Zeval (RAG-Zero Evaluator), a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task. Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments with detailed explanation in one-pass. We introduce a ranking-based outcome reward mechanism, using preference judgments rather than absolute scores, to address the challenge of obtaining precise pointwise reward signals. To this end, we synthesize the ranking references by generating quality-controlled responses with zero human annotation. Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments and outperforming baselines that rely on LLMs with 10-100 times more parameters. Our approach also exhibits superior interpretability in response evaluation.

Paper Structure

This paper contains 27 sections, 7 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: An overview of RAG-ZEval. We synthesize training data using Context-Aware Decoding. The complete prompt is presented in Fig.\ref{['fig:prompt']}. The ground-truth ranking of $\bm{y}$'s depends on the value of $\alpha$.
  • Figure 2: The density distribution of the scores assigned by the faithfulness evaluators.The distribution of the faithful and unfaithful responses are marked with red and blue, respectively. TruLens, RAGAS and RAG-Checker are all implemented with Qwen2.5-72B-Instruct as the backbone LLM.
  • Figure 3: (a) shows the changes of decomposed claim count, while (b) presents the evolution of abilities of evidence extraction and supportiveness judgment throughout the RL training process. The statistics are based on the rollout samples during training.
  • Figure 4: Reward dynamics of RL training with different data configuration. The red line represents the curriculum learning settings, while the green and blue lines are for static 3 and 4 responses, respectively.
  • Figure 5: The complete prompt used in training the evaluator. Given the current question, context, and $K$ candidate answers, the evaluator outputs a JSON-formatted string containing detailed evaluation for each candidate answer. Each evaluation follows the four key steps (highlighted in purple) to assess answer quality.
  • ...and 2 more figures