Table of Contents
Fetching ...

DEE: Dual-stage Explainable Evaluation Method for Text Generation

Shenyu Zhang, Yu Li, Rui Wu, Xiutian Huang, Yongrui Chen, Wenhao Xu, Guilin Qi

TL;DR

Experimental results affirm that DEE's superiority over existing evaluation methods, achieving significant improvements in both human correlation as well as efficiency.

Abstract

Automatic methods for evaluating machine-generated texts hold significant importance due to the expanding applications of generative systems. Conventional methods tend to grapple with a lack of explainability, issuing a solitary numerical score to signify the assessment outcome. Recent advancements have sought to mitigate this limitation by incorporating large language models (LLMs) to offer more detailed error analyses, yet their applicability remains constrained, particularly in industrial contexts where comprehensive error coverage and swift detection are paramount. To alleviate these challenges, we introduce DEE, a Dual-stage Explainable Evaluation method for estimating the quality of text generation. Built upon Llama 2, DEE follows a dual-stage principle guided by stage-specific instructions to perform efficient identification of errors in generated texts in the initial stage and subsequently delves into providing comprehensive diagnostic reports in the second stage. DEE is fine-tuned on our elaborately assembled dataset AntEval, which encompasses 15K examples from 4 real-world applications of Alipay that employ generative systems. The dataset concerns newly emerged issues like hallucination and toxicity, thereby broadening the scope of DEE's evaluation criteria. Experimental results affirm that DEE's superiority over existing evaluation methods, achieving significant improvements in both human correlation as well as efficiency.

DEE: Dual-stage Explainable Evaluation Method for Text Generation

TL;DR

Experimental results affirm that DEE's superiority over existing evaluation methods, achieving significant improvements in both human correlation as well as efficiency.

Abstract

Automatic methods for evaluating machine-generated texts hold significant importance due to the expanding applications of generative systems. Conventional methods tend to grapple with a lack of explainability, issuing a solitary numerical score to signify the assessment outcome. Recent advancements have sought to mitigate this limitation by incorporating large language models (LLMs) to offer more detailed error analyses, yet their applicability remains constrained, particularly in industrial contexts where comprehensive error coverage and swift detection are paramount. To alleviate these challenges, we introduce DEE, a Dual-stage Explainable Evaluation method for estimating the quality of text generation. Built upon Llama 2, DEE follows a dual-stage principle guided by stage-specific instructions to perform efficient identification of errors in generated texts in the initial stage and subsequently delves into providing comprehensive diagnostic reports in the second stage. DEE is fine-tuned on our elaborately assembled dataset AntEval, which encompasses 15K examples from 4 real-world applications of Alipay that employ generative systems. The dataset concerns newly emerged issues like hallucination and toxicity, thereby broadening the scope of DEE's evaluation criteria. Experimental results affirm that DEE's superiority over existing evaluation methods, achieving significant improvements in both human correlation as well as efficiency.
Paper Structure (23 sections, 1 equation, 5 figures, 2 tables)

This paper contains 23 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: DEE is fine-tuned on AntEval, applying dual-stage strategy to perform fast error detection in Stage I and provide diagnostic report in Stage II.
  • Figure 2: Left: The distribution of error categories in AntEval. The inner circle depicts the principle categories and the outer circle depicts the corresponding sub-error categories. Right: Error distribution in each task, calculated by $N_\text{error}$ / $N_\text{example}$. Here we count the number of sub-errors, which may occur more than once in one example.
  • Figure 3: The annotation interface for human experts to score generated texts.
  • Figure 4: Experiments for comparison with PLM-based methods. Inference time per example (ms), Kendall's Tau $\mathbf{\tau}$ (%) and Pearson’s correlation coefficient $\mathbf{\rho} (\%)$ are reported.
  • Figure 5: Left Four: EC (%) and VR (%) across different tasks. Right: EC' (%) across different error categories.