On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation
John Mendonça, Alon Lavie, Isabel Trancoso
TL;DR
The paper argues that current open-domain dialogue evaluation benchmarks are misaligned with the capabilities of state-of-the-art chatbots, as they rely on outdated generative models and focus on Fluency and Relevance. It surveys benchmark datasets and LLM-based evaluators, then conducts a qualitative study using the SODA dataset with expert annotations to assess how well modern evaluators detect issues across Fluency, Coherence, and Commonsense. The findings show that large LLMs, particularly GPT-4, offer the strongest overall correlations but still struggle with coherence and commonsense detection, highlighting the need for new benchmarks that reflect contemporary generation and multilingual contexts. The work advocates prioritizing Coherence and Commonsense, expanding multilingual coverage, and developing flexible benchmarks that stay relevant as chatbots evolve, to support more reliable and comparable evaluations of open-domain dialogue systems.
Abstract
Large Language Models (LLMs) have showcased remarkable capabilities in various Natural Language Processing tasks. For automatic open-domain dialogue evaluation in particular, LLMs have been seamlessly integrated into evaluation frameworks, and together with human evaluation, compose the backbone of most evaluations. However, existing evaluation benchmarks often rely on outdated datasets and evaluate aspects like Fluency and Relevance, which fail to adequately capture the capabilities and limitations of state-of-the-art chatbot models. This paper critically examines current evaluation benchmarks, highlighting that the use of older response generators and quality aspects fail to accurately reflect modern chatbot capabilities. A small annotation experiment on a recent LLM-generated dataset (SODA) reveals that LLM evaluators such as GPT-4 struggle to detect actual deficiencies in dialogues generated by current LLM chatbots.
