A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators

Chen Zhang; Luis Fernando D'Haro; Yiming Chen; Malu Zhang; Haizhou Li

A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators

Chen Zhang, Luis Fernando D'Haro, Yiming Chen, Malu Zhang, Haizhou Li

TL;DR

This work tackles automatic dialogue evaluation by comprehensively analyzing 30 LLMs (28 open-source and 2 proprietary) across 12 meta-evaluation datasets and five quality dimensions at both turn- and dialogue-levels. It introduces GPT-4-based augmentation to fill missing annotations, uses Pearson correlations to benchmark LLM judgments against human and GPT-4 baselines, and evaluates robustness through a suite of adversarial perturbations. The findings show that instruction-tuned and larger proprietary LLMs (notably GPT-4 and Palm-2) achieve the strongest alignment with human judgments, yet no model reaches near-perfect correlations ($>0.8$) on average; however, dimension-wise and model-wise ensembles of open-source models can match or approach proprietary performance in several settings. The paper further reveals that models excel in coherence, relevance, and overall quality while remaining vulnerable to certain perturbations, underscoring the importance of robust, multi-faceted evaluation strategies for deploying LLM-based dialogue evaluators in practice.

Abstract

Automatic evaluation is an integral aspect of dialogue system research. The traditional reference-based NLG metrics are generally found to be unsuitable for dialogue assessment. Consequently, recent studies have suggested various unique, reference-free neural metrics that better align with human evaluations. Notably among them, large language models (LLMs), particularly the instruction-tuned variants like ChatGPT, are shown to be promising substitutes for human judges. Yet, existing works on utilizing LLMs for automatic dialogue evaluation are limited in their scope in terms of the number of meta-evaluation datasets, mode of evaluation, coverage of LLMs, etc. Hence, it remains inconclusive how effective these LLMs are. To this end, we conduct a comprehensive study on the application of LLMs for automatic dialogue evaluation. Specifically, we analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels, using a comprehensive set of 12 meta-evaluation datasets. Additionally, we probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels. Finally, we explore how model-level and dimension-level ensembles impact the evaluation performance. All resources are available at https://github.com/e0397123/comp-analysis.

A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators

TL;DR

) on average; however, dimension-wise and model-wise ensembles of open-source models can match or approach proprietary performance in several settings. The paper further reveals that models excel in coherence, relevance, and overall quality while remaining vulnerable to certain perturbations, underscoring the importance of robust, multi-faceted evaluation strategies for deploying LLM-based dialogue evaluators in practice.

Abstract

Paper Structure (52 sections, 1 equation, 4 figures, 11 tables)

This paper contains 52 sections, 1 equation, 4 figures, 11 tables.

Introduction
Preliminaries
Meta-Evaluation
Datasets
Fill Up Missing Annotations With GPT-4
Meta-Evaluation Metrics
Large Language Models
Dialogue Evaluation with LLMs
Multi-Dimensional Correlation Analysis
Proprietary vs Open-Source Models
Instruction-Tuned vs Vanilla Models
LLaMA vs Other Open-Source Families
Impact of Instruction Data
Performance Across Dimensions
Performance of GPT-4
...and 37 more sections

Figures (4)

Figure 1: The instruction template for prompting GPT-4 to annotate both dialogue-level (top) and turn-level (bottom) data. For our meta-evaluation of the proprietary models including ChatGPT and Palm-2 Bison, we also use this instruction template.
Figure 2: An example for prompting open-source LLMs to evaluate the contextual relevance of the input response.
Figure 3: Inter-dimensional correlations of the gold human ratings and different model scores on the FED-Dial dataset mehri-eskenazi-2020-unsupervised
Figure 4: Inter-dimensional correlations of the gold human ratings and different model scores on the FED-Turn dataset mehri-eskenazi-2020-unsupervised

A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators

TL;DR

Abstract

A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators

Authors

TL;DR

Abstract

Table of Contents

Figures (4)