Table of Contents
Fetching ...

A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators

Chen Zhang, Luis Fernando D'Haro, Yiming Chen, Malu Zhang, Haizhou Li

TL;DR

This work tackles automatic dialogue evaluation by comprehensively analyzing 30 LLMs (28 open-source and 2 proprietary) across 12 meta-evaluation datasets and five quality dimensions at both turn- and dialogue-levels. It introduces GPT-4-based augmentation to fill missing annotations, uses Pearson correlations to benchmark LLM judgments against human and GPT-4 baselines, and evaluates robustness through a suite of adversarial perturbations. The findings show that instruction-tuned and larger proprietary LLMs (notably GPT-4 and Palm-2) achieve the strongest alignment with human judgments, yet no model reaches near-perfect correlations ($>0.8$) on average; however, dimension-wise and model-wise ensembles of open-source models can match or approach proprietary performance in several settings. The paper further reveals that models excel in coherence, relevance, and overall quality while remaining vulnerable to certain perturbations, underscoring the importance of robust, multi-faceted evaluation strategies for deploying LLM-based dialogue evaluators in practice.

Abstract

Automatic evaluation is an integral aspect of dialogue system research. The traditional reference-based NLG metrics are generally found to be unsuitable for dialogue assessment. Consequently, recent studies have suggested various unique, reference-free neural metrics that better align with human evaluations. Notably among them, large language models (LLMs), particularly the instruction-tuned variants like ChatGPT, are shown to be promising substitutes for human judges. Yet, existing works on utilizing LLMs for automatic dialogue evaluation are limited in their scope in terms of the number of meta-evaluation datasets, mode of evaluation, coverage of LLMs, etc. Hence, it remains inconclusive how effective these LLMs are. To this end, we conduct a comprehensive study on the application of LLMs for automatic dialogue evaluation. Specifically, we analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels, using a comprehensive set of 12 meta-evaluation datasets. Additionally, we probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels. Finally, we explore how model-level and dimension-level ensembles impact the evaluation performance. All resources are available at https://github.com/e0397123/comp-analysis.

A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators

TL;DR

This work tackles automatic dialogue evaluation by comprehensively analyzing 30 LLMs (28 open-source and 2 proprietary) across 12 meta-evaluation datasets and five quality dimensions at both turn- and dialogue-levels. It introduces GPT-4-based augmentation to fill missing annotations, uses Pearson correlations to benchmark LLM judgments against human and GPT-4 baselines, and evaluates robustness through a suite of adversarial perturbations. The findings show that instruction-tuned and larger proprietary LLMs (notably GPT-4 and Palm-2) achieve the strongest alignment with human judgments, yet no model reaches near-perfect correlations () on average; however, dimension-wise and model-wise ensembles of open-source models can match or approach proprietary performance in several settings. The paper further reveals that models excel in coherence, relevance, and overall quality while remaining vulnerable to certain perturbations, underscoring the importance of robust, multi-faceted evaluation strategies for deploying LLM-based dialogue evaluators in practice.

Abstract

Automatic evaluation is an integral aspect of dialogue system research. The traditional reference-based NLG metrics are generally found to be unsuitable for dialogue assessment. Consequently, recent studies have suggested various unique, reference-free neural metrics that better align with human evaluations. Notably among them, large language models (LLMs), particularly the instruction-tuned variants like ChatGPT, are shown to be promising substitutes for human judges. Yet, existing works on utilizing LLMs for automatic dialogue evaluation are limited in their scope in terms of the number of meta-evaluation datasets, mode of evaluation, coverage of LLMs, etc. Hence, it remains inconclusive how effective these LLMs are. To this end, we conduct a comprehensive study on the application of LLMs for automatic dialogue evaluation. Specifically, we analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels, using a comprehensive set of 12 meta-evaluation datasets. Additionally, we probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels. Finally, we explore how model-level and dimension-level ensembles impact the evaluation performance. All resources are available at https://github.com/e0397123/comp-analysis.
Paper Structure (52 sections, 1 equation, 4 figures, 11 tables)

This paper contains 52 sections, 1 equation, 4 figures, 11 tables.

Figures (4)

  • Figure 1: The instruction template for prompting GPT-4 to annotate both dialogue-level (top) and turn-level (bottom) data. For our meta-evaluation of the proprietary models including ChatGPT and Palm-2 Bison, we also use this instruction template.
  • Figure 2: An example for prompting open-source LLMs to evaluate the contextual relevance of the input response.
  • Figure 3: Inter-dimensional correlations of the gold human ratings and different model scores on the FED-Dial dataset mehri-eskenazi-2020-unsupervised
  • Figure 4: Inter-dimensional correlations of the gold human ratings and different model scores on the FED-Turn dataset mehri-eskenazi-2020-unsupervised