Table of Contents
Fetching ...

Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models

Shuliang Liu, Xinze Li, Zhenghao Liu, Yukun Yan, Cheng Yang, Zheni Zeng, Zhiyuan Liu, Maosong Sun, Ge Yu

TL;DR

This work presents ConsJudge, a self-improving framework that enhances LLM-based judgments for evaluating and optimizing Retrieval-Augmented Generation (RAG) systems. By prompting judgments across multiple dimensions (hallucination, completeness, coherence, semantic consistency) and enforcing judge-consistency through a direct preference optimization (DPO) loop, ConsJudge yields higher-quality evaluations that align more closely with superior LLMs and human judgments. The approach improves RAG training outcomes across diverse datasets and backbones, demonstrating the value of a consistency-driven, multi-dimensional judgment model in reducing evaluation bias and guiding more effective retrieval-augmented generation. The method offers a practical, distillation-free path to stronger judgment models and improved RAG performance, with publicly available code for reproducibility and extension.

Abstract

Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs). However, existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. LLM-based judgment models provide the potential to produce high-quality judgments, but they are highly sensitive to evaluation prompts, leading to inconsistencies when judging the output of RAG models. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models. Specifically, ConsJudge prompts LLMs to generate different judgments based on various combinations of judgment dimensions, utilize the judge-consistency to evaluate these judgments and select the accepted and rejected judgments for DPO training. Our experiments show that ConsJudge can effectively provide more accurate judgments for optimizing RAG models across various RAG models and datasets. Further analysis reveals that judgments generated by ConsJudge have a high agreement with the superior LLM. All codes are available at https://github.com/OpenBMB/ConsJudge.

Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models

TL;DR

This work presents ConsJudge, a self-improving framework that enhances LLM-based judgments for evaluating and optimizing Retrieval-Augmented Generation (RAG) systems. By prompting judgments across multiple dimensions (hallucination, completeness, coherence, semantic consistency) and enforcing judge-consistency through a direct preference optimization (DPO) loop, ConsJudge yields higher-quality evaluations that align more closely with superior LLMs and human judgments. The approach improves RAG training outcomes across diverse datasets and backbones, demonstrating the value of a consistency-driven, multi-dimensional judgment model in reducing evaluation bias and guiding more effective retrieval-augmented generation. The method offers a practical, distillation-free path to stronger judgment models and improved RAG performance, with publicly available code for reproducibility and extension.

Abstract

Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs). However, existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. LLM-based judgment models provide the potential to produce high-quality judgments, but they are highly sensitive to evaluation prompts, leading to inconsistencies when judging the output of RAG models. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models. Specifically, ConsJudge prompts LLMs to generate different judgments based on various combinations of judgment dimensions, utilize the judge-consistency to evaluate these judgments and select the accepted and rejected judgments for DPO training. Our experiments show that ConsJudge can effectively provide more accurate judgments for optimizing RAG models across various RAG models and datasets. Further analysis reveals that judgments generated by ConsJudge have a high agreement with the superior LLM. All codes are available at https://github.com/OpenBMB/ConsJudge.

Paper Structure

This paper contains 19 sections, 7 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: The Framework of ConsJudge. It enhances the judgment capabilities of LLMs and benefits the training process of RAG models.
  • Figure 2: The Framework of Our ConsJudge Method.
  • Figure 3: Judge Agreement Evaluation. We analyze the agreements of different judgment models (Figure \ref{['fig:result:quality:agreement']}), and use GLM-4-plus to evaluate the judge quality of different models (Figure \ref{['fig:result:quality:glm']}). GLM and Metric denotes the GLM-4-plus and Raw Metric models. Vanilla LLM and ConsJudge are implemented with Qwen2.5-14B.
  • Figure 4: Judgment Consistency of Vanilla LLMs and ConsJudge. We use both vanilla LLMs and ConsJudge to show the judgment consistency among all hybrid evaluation aspects used to train ConsJudge.
  • Figure 5: Distribution of Judgment Consistency Score of Both Vanilla LLMs and ConsJudge.
  • ...and 5 more figures