Table of Contents
Fetching ...

Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response

Yongkang Liu, Shi Feng, Daling Wang, Yifei Zhang, Hinrich Schütze

TL;DR

This work interrogates the reliability of LLM-based reference-free evaluators for generated dialogue, arguing that purely reference-free scores struggle in closed-ended and knowledge-dependent settings. To probe this, the authors construct two adversarial meta-evaluation datasets, KdConv-ADV and DSTC7-ADV, combining open- and closed-ended cases with adversarial and knowledge-grounded examples. Through extensive experiments, they show that traditional reference-based metrics can outperform in low-diversity contexts, while reference-free evaluators exhibit limited discrimination and rely on external knowledge for reliability. The findings highlight the need for knowledge grounding and hybrid evaluation strategies to improve robustness of dialogue quality assessment in real-world scenarios.

Abstract

LLMs (large language models) such as ChatGPT have shown remarkable language understanding and generation capabilities. Although reference-free evaluators based on LLMs show better human alignment than traditional reference-based evaluators, there are many challenges in using reference-free evaluators based on LLMs. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. But not all examples are open-ended. For closed-ended examples with unique correct semantic response, reference-free evaluators will still consider it high quality when giving a response that is inconsistent with the facts and the semantic of reference. In order to comprehensively evaluate the reliability of evaluators based on LLMs, we construct two adversarial meta-evaluation dialogue generation datasets KdConv-ADV and DSTC7-ADV based on KdConv and DSTC7-AVSD, respectively. Compared to previous meta-evaluation benchmarks, KdConv-ADV and DSTC7-ADV are much more challenging since they requires evaluators to be able to reasonably evaluate closed-ended examples with the help of external knowledge or even its own knowledge. Empirical results show that the ability of LLMs to identify unreasonable responses is insufficient. There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.

Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response

TL;DR

This work interrogates the reliability of LLM-based reference-free evaluators for generated dialogue, arguing that purely reference-free scores struggle in closed-ended and knowledge-dependent settings. To probe this, the authors construct two adversarial meta-evaluation datasets, KdConv-ADV and DSTC7-ADV, combining open- and closed-ended cases with adversarial and knowledge-grounded examples. Through extensive experiments, they show that traditional reference-based metrics can outperform in low-diversity contexts, while reference-free evaluators exhibit limited discrimination and rely on external knowledge for reliability. The findings highlight the need for knowledge grounding and hybrid evaluation strategies to improve robustness of dialogue quality assessment in real-world scenarios.

Abstract

LLMs (large language models) such as ChatGPT have shown remarkable language understanding and generation capabilities. Although reference-free evaluators based on LLMs show better human alignment than traditional reference-based evaluators, there are many challenges in using reference-free evaluators based on LLMs. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. But not all examples are open-ended. For closed-ended examples with unique correct semantic response, reference-free evaluators will still consider it high quality when giving a response that is inconsistent with the facts and the semantic of reference. In order to comprehensively evaluate the reliability of evaluators based on LLMs, we construct two adversarial meta-evaluation dialogue generation datasets KdConv-ADV and DSTC7-ADV based on KdConv and DSTC7-AVSD, respectively. Compared to previous meta-evaluation benchmarks, KdConv-ADV and DSTC7-ADV are much more challenging since they requires evaluators to be able to reasonably evaluate closed-ended examples with the help of external knowledge or even its own knowledge. Empirical results show that the ability of LLMs to identify unreasonable responses is insufficient. There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.
Paper Structure (25 sections, 4 figures, 6 tables)

This paper contains 25 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Evaluation examples of ChatGPT. The correct response semantic for this example is unique. The reference response is I checked and his hometown should be Düsseldorf, Germany.
  • Figure 2: The distribution of reference-based metrics for different types of examples on KdConv-ADV (left) and DSTC7-ADV (right). The over corresponds to the performance of overall datasets.
  • Figure 3: The score distribution of ChatGPT on DSTC7-ADV (up) and KDConv-ADV (down).
  • Figure 4: Prompt Template. The {aspect} denotes the evaluation dimension, such as fluency. The explanation of {aspect} includes corresponding level definitions. The {SEP} represents the delimiter.