Table of Contents
Fetching ...

Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks

Andrea Sottana, Bin Liang, Kai Zou, Zheng Yuan

TL;DR

This study scrutinises how well automatic evaluation metrics reflect human judgments for sequence-to-sequence tasks in the GPT-4 era. By conducting a zero-shot, multi-dataset, multi-model evaluation across text summarisation, simplification, and GEC and by incorporating human judgments and GPT-4 as a reviewer, the authors reveal strong misalignment between reference-based metrics and human opinions. They show open-source models frequently outperform gold references, and that GPT-4 can rank model outputs in line with humans for most tasks, though with task-specific variations. The work highlights the need for improved evaluation designs and prompts, and suggests broader, prompt-engineering-focused future studies to better capture real-world model capabilities.

Abstract

Large Language Models (LLMs) evaluation is a patchy and inconsistent landscape, and it is becoming clear that the quality of automatic evaluation metrics is not keeping up with the pace of development of generative models. We aim to improve the understanding of current models' performance by providing a preliminary and hybrid evaluation on a range of open and closed-source generative LLMs on three NLP benchmarks: text summarisation, text simplification and grammatical error correction (GEC), using both automatic and human evaluation. We also explore the potential of the recently released GPT-4 to act as an evaluator. We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics, while scoring much more poorly when using classic automatic evaluation metrics. We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks. Finally, we find that GPT-4 is capable of ranking models' outputs in a way which aligns reasonably closely to human judgement despite task-specific variations, with a lower alignment in the GEC task.

Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks

TL;DR

This study scrutinises how well automatic evaluation metrics reflect human judgments for sequence-to-sequence tasks in the GPT-4 era. By conducting a zero-shot, multi-dataset, multi-model evaluation across text summarisation, simplification, and GEC and by incorporating human judgments and GPT-4 as a reviewer, the authors reveal strong misalignment between reference-based metrics and human opinions. They show open-source models frequently outperform gold references, and that GPT-4 can rank model outputs in line with humans for most tasks, though with task-specific variations. The work highlights the need for improved evaluation designs and prompts, and suggests broader, prompt-engineering-focused future studies to better capture real-world model capabilities.

Abstract

Large Language Models (LLMs) evaluation is a patchy and inconsistent landscape, and it is becoming clear that the quality of automatic evaluation metrics is not keeping up with the pace of development of generative models. We aim to improve the understanding of current models' performance by providing a preliminary and hybrid evaluation on a range of open and closed-source generative LLMs on three NLP benchmarks: text summarisation, text simplification and grammatical error correction (GEC), using both automatic and human evaluation. We also explore the potential of the recently released GPT-4 to act as an evaluator. We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics, while scoring much more poorly when using classic automatic evaluation metrics. We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks. Finally, we find that GPT-4 is capable of ranking models' outputs in a way which aligns reasonably closely to human judgement despite task-specific variations, with a lower alignment in the GEC task.
Paper Structure (20 sections, 7 tables)