Table of Contents
Fetching ...

Can We Trust the Performance Evaluation of Uncertainty Estimation Methods in Text Summarization?

Jianfeng He, Runing Yang, Linlin Yu, Changbin Li, Ruoxi Jia, Feng Chen, Ming Jin, Chang-Tien Lu

TL;DR

A comprehensive UE-TS benchmark incorporating 31 NLG metrics across four dimensions is introduced and the importance of considering multiple uncorrelated NLG metrics and diverse uncertainty estimation methods to ensure reliable and efficient evaluation of UE-TS techniques is emphasized.

Abstract

Text summarization, a key natural language generation (NLG) task, is vital in various domains. However, the high cost of inaccurate summaries in risk-critical applications, particularly those involving human-in-the-loop decision-making, raises concerns about the reliability of uncertainty estimation on text summarization (UE-TS) evaluation methods. This concern stems from the dependency of uncertainty model metrics on diverse and potentially conflicting NLG metrics. To address this issue, we introduce a comprehensive UE-TS benchmark incorporating 31 NLG metrics across four dimensions. The benchmark evaluates the uncertainty estimation capabilities of two large language models and one pre-trained language model on three datasets, with human-annotation analysis incorporated where applicable. We also assess the performance of 14 common uncertainty estimation methods within this benchmark. Our findings emphasize the importance of considering multiple uncorrelated NLG metrics and diverse uncertainty estimation methods to ensure reliable and efficient evaluation of UE-TS techniques. Our code and data are available https://github.com/he159ok/Benchmark-of-Uncertainty-Estimation-Methods-in-Text-Summarization.

Can We Trust the Performance Evaluation of Uncertainty Estimation Methods in Text Summarization?

TL;DR

A comprehensive UE-TS benchmark incorporating 31 NLG metrics across four dimensions is introduced and the importance of considering multiple uncorrelated NLG metrics and diverse uncertainty estimation methods to ensure reliable and efficient evaluation of UE-TS techniques is emphasized.

Abstract

Text summarization, a key natural language generation (NLG) task, is vital in various domains. However, the high cost of inaccurate summaries in risk-critical applications, particularly those involving human-in-the-loop decision-making, raises concerns about the reliability of uncertainty estimation on text summarization (UE-TS) evaluation methods. This concern stems from the dependency of uncertainty model metrics on diverse and potentially conflicting NLG metrics. To address this issue, we introduce a comprehensive UE-TS benchmark incorporating 31 NLG metrics across four dimensions. The benchmark evaluates the uncertainty estimation capabilities of two large language models and one pre-trained language model on three datasets, with human-annotation analysis incorporated where applicable. We also assess the performance of 14 common uncertainty estimation methods within this benchmark. Our findings emphasize the importance of considering multiple uncorrelated NLG metrics and diverse uncertainty estimation methods to ensure reliable and efficient evaluation of UE-TS techniques. Our code and data are available https://github.com/he159ok/Benchmark-of-Uncertainty-Estimation-Methods-in-Text-Summarization.
Paper Structure (37 sections, 3 equations, 41 figures, 11 tables)

This paper contains 37 sections, 3 equations, 41 figures, 11 tables.

Figures (41)

  • Figure 1: Diagram of the relationship between the Uncertainty Estimation (UE) metric, NLG metrics, and UE methods in the evaluation process. Specifically, the evaluation process for UE-TS methods involves using the generated texts (or intermediate outputs, such as token probabilities) and the optional input text (or ground-truth summary) to obtain NLG metric scores and uncertainty scores for all test samples, through an NLG metric and a UE method, respectively. Finally, the NLG metric scores and uncertainty scores for all testing samples are both inputted into a UE metric to obtain an uncertainty metric score of the UE method.
  • Figure 2: Diagram of the $PR_{\phi}$ calculation example with testing sample size $N=4$. In this example, we have min-max normalized $\hat{s}_{NLG}=[0, 0.56, 0.47, 1]$, which is not drawn in the figure. Once we have obtained the sample rank $a_{\phi}$ based on a score list from method $\phi$. We rerank $r_{NLG}$ via $a_{\phi}$ to get $r_{\phi}$. Then, we use Eq. \ref{['eq:cum_risk']} to cumulatively sum the elements and obtain $\widetilde{r}_{\phi}$. Finally, the $PR_{\phi}$ is the mean of $\widetilde{r}_{\phi}$.
  • Figure 3: Diagram of Spearman correlation between NLG metrics on AESLC dataset from the view of uncertainty estimation methods used in Fig. \ref{['fig:spear_ue_aes_bart']}. The generated summaries are from BART. For the GPT-3.5-based NLG metrics, we only conduct wo-GPT-3.5 on the BART generation model setting.
  • Figure 4: Diagram of Spearman correlation between NLG metrics on AESLC dataset from the view of uncertainty estimation methods used in Fig. \ref{['fig:spear_ue_aes_gpt35']}. The generated summaries are from GPT-3.5. For the GPT-3.5-based NLG metrics, we only draw wi-ingt-GPT-3.5 results to save space.
  • Figure 5: Diagram of Spearman correlation between NLG metrics on AESLC dataset from the view of uncertainty estimation methods used in Fig. \ref{['fig:spear_ue_aes_llama']}. The generated summaries are from Llama 2. For the GPT-3.5-based NLG metrics, we only draw wi-ingt-GPT-3.5 results to save space.
  • ...and 36 more figures