Table of Contents
Fetching ...

RepEval: Effective Text Evaluation with LLM Representation

Shuqian Sheng, Yi Xu, Tianhang Zhang, Zanwei Shen, Luoyi Fu, Jiaxin Ding, Lei Zhou, Xiaoying Gan, Xinbing Wang, Chenghu Zhou

TL;DR

RepEval is introduced, a metric that leverages the projection of LLM representations for evaluation that exhibits a higher correlation with human judgments than previous methods, even in complex evaluation scenarios involving pair-wise selection under nuanced aspects.

Abstract

The era of Large Language Models (LLMs) raises new demands for automatic evaluation metrics, which should be adaptable to various application scenarios while maintaining low cost and effectiveness. Traditional metrics for automatic text evaluation are often tailored to specific scenarios, while LLM-based evaluation metrics are costly, requiring fine-tuning or rely heavily on the generation capabilities of LLMs. Besides, previous LLM-based metrics ignore the fact that, within the space of LLM representations, there exist direction vectors that indicate the estimation of text quality. To this end, we introduce RepEval, a metric that leverages the projection of LLM representations for evaluation. Through simple prompt modifications, RepEval can easily transition to various tasks, requiring only minimal sample pairs for direction vector construction. Results on fourteen datasets across two evaluation tasks demonstrate the high effectiveness of our method, which exhibits a higher correlation with human judgments than previous methods, even in complex evaluation scenarios involving pair-wise selection under nuanced aspects. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.

RepEval: Effective Text Evaluation with LLM Representation

TL;DR

RepEval is introduced, a metric that leverages the projection of LLM representations for evaluation that exhibits a higher correlation with human judgments than previous methods, even in complex evaluation scenarios involving pair-wise selection under nuanced aspects.

Abstract

The era of Large Language Models (LLMs) raises new demands for automatic evaluation metrics, which should be adaptable to various application scenarios while maintaining low cost and effectiveness. Traditional metrics for automatic text evaluation are often tailored to specific scenarios, while LLM-based evaluation metrics are costly, requiring fine-tuning or rely heavily on the generation capabilities of LLMs. Besides, previous LLM-based metrics ignore the fact that, within the space of LLM representations, there exist direction vectors that indicate the estimation of text quality. To this end, we introduce RepEval, a metric that leverages the projection of LLM representations for evaluation. Through simple prompt modifications, RepEval can easily transition to various tasks, requiring only minimal sample pairs for direction vector construction. Results on fourteen datasets across two evaluation tasks demonstrate the high effectiveness of our method, which exhibits a higher correlation with human judgments than previous methods, even in complex evaluation scenarios involving pair-wise selection under nuanced aspects. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
Paper Structure (57 sections, 7 equations, 5 figures, 7 tables)

This paper contains 57 sections, 7 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Pipeline of collecting representations with decoder-only LLM and constructing project direction.
  • Figure 2: Evaluation process of absolute evaluation and pair-wise evaluation.
  • Figure 3: Correlation results for the absolute evaluation of fluency using RepEval with different token and position selections. Layer and token counts are in reverse order, measuring the distance from the output. For instance, layer=-1 represents the last layer closest to the output.
  • Figure 4: The t-SNE visualization of $rep$s shows the results of dimensionality reduction. The triangles and X on each figure represent the $rep$s of the same sample obtained using different prompts.
  • Figure 5: Random Test Results Box plots represent meta-evaluation results corresponding to random vectors $v$, while the scatter points in the figure represent the results corresponding to direction vector $d$ obtained through PCA. For pair-wise evaluation, the y-axis starts at 0.5, which is the expected accuracy of random guessing.