Table of Contents
Fetching ...

Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models

Aliyah R. Hsu, James Zhu, Zhichao Wang, Bin Bi, Shubham Mehrotra, Shiva K. Pentyala, Katherine Tan, Xiang-Bo Mao, Roshanak Omrani, Sougata Chaudhuri, Regunathan Radhakrishnan, Sitaram Asur, Claire Na Cheng, Bin Yu

TL;DR

The paper proposes Rate, Explain and Cite (REC), a family of fine-tuned general-purpose LLM auto-evaluators (REC-8B, REC-12B, REC-70B) that deliver ratings, explanations, and verifiable citations for generated content across faithfulness, instruction-following, coherence, and completeness. It introduces REC-Data, a large synthetic dataset for content-quality and RAG citations, and supports multiple citation modes to balance latency and granularity. Across extensive benchmarks (ALCE, ExpertQA, ABCD, RewardBench, LLM-AggreFact, CoBBLEr), REC-70B achieves state-of-the-art performance in content evaluation, with improved explanation quality and citation reliability. The work provides a public release of models and data, discusses training via LoRA, and addresses practical considerations such as latency, multilingual capability, and ethical implications of automated evaluation.

Abstract

LLMs have demonstrated impressive proficiency in generating coherent and high-quality text, making them valuable across a range of text-generation tasks. However, rigorous evaluation of this generated content is crucial, as ensuring its quality remains a significant challenge due to persistent issues such as factual inaccuracies and hallucination. This paper introduces three fine-tuned general-purpose LLM autoevaluators, REC-8B, REC-12B and REC-70B, specifically designed to evaluate generated text across several dimensions: faithfulness, instruction following, coherence, and completeness. These models not only provide ratings for these metrics but also offer detailed explanation and verifiable citation, thereby enhancing trust in the content. Moreover, the models support various citation modes, accommodating different requirements for latency and granularity. Extensive evaluations on diverse benchmarks demonstrate that our general-purpose LLM auto-evaluator, REC-70B, outperforms state-of-the-art LLMs, excelling in content evaluation by delivering better quality explanation and citation with minimal bias. Our REC dataset and models are available at https://github.com/adelaidehsu/REC.

Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models

TL;DR

The paper proposes Rate, Explain and Cite (REC), a family of fine-tuned general-purpose LLM auto-evaluators (REC-8B, REC-12B, REC-70B) that deliver ratings, explanations, and verifiable citations for generated content across faithfulness, instruction-following, coherence, and completeness. It introduces REC-Data, a large synthetic dataset for content-quality and RAG citations, and supports multiple citation modes to balance latency and granularity. Across extensive benchmarks (ALCE, ExpertQA, ABCD, RewardBench, LLM-AggreFact, CoBBLEr), REC-70B achieves state-of-the-art performance in content evaluation, with improved explanation quality and citation reliability. The work provides a public release of models and data, discusses training via LoRA, and addresses practical considerations such as latency, multilingual capability, and ethical implications of automated evaluation.

Abstract

LLMs have demonstrated impressive proficiency in generating coherent and high-quality text, making them valuable across a range of text-generation tasks. However, rigorous evaluation of this generated content is crucial, as ensuring its quality remains a significant challenge due to persistent issues such as factual inaccuracies and hallucination. This paper introduces three fine-tuned general-purpose LLM autoevaluators, REC-8B, REC-12B and REC-70B, specifically designed to evaluate generated text across several dimensions: faithfulness, instruction following, coherence, and completeness. These models not only provide ratings for these metrics but also offer detailed explanation and verifiable citation, thereby enhancing trust in the content. Moreover, the models support various citation modes, accommodating different requirements for latency and granularity. Extensive evaluations on diverse benchmarks demonstrate that our general-purpose LLM auto-evaluator, REC-70B, outperforms state-of-the-art LLMs, excelling in content evaluation by delivering better quality explanation and citation with minimal bias. Our REC dataset and models are available at https://github.com/adelaidehsu/REC.

Paper Structure

This paper contains 39 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Illustration of the REC framework for content quality evaluation. The auto-evaluator takes in context (a task prompt from another LLM), generation from another LLM, and a user-specified evaluation metric (i.e., completeness). The auto-evaluator outputs rating, explanation with citation according to a user-specified citation mode (i.e., inline with context snippet). Citation are extracted verbatim from context as underlined. For details of the alternative citation mode and the evaluation prompt, see Appendix \ref{['sec:appendix_content_quality_citation']}.
  • Figure 2: Illustration of using REC models for general RAG citations.
  • Figure 3: Detailed breakdown of REC-Data distribution.