MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators
John Mendonça, Alon Lavie, Isabel Trancoso
TL;DR
MEDAL introduces a scalable, multilingual framework for benchmarking LLMs as open-domain dialogue evaluators through a three-stage pipeline: native multilingual dialogue generation with a multi-agent setup (user, chatbot, judge), automated multidimensional evaluation by a strong LLM, and human-guided benchmark curation with annotations. The approach reveals cross-lingual differences in chatbot performance and evaluator capabilities, showing that reasoning LLM evaluators outperform non-reasoning ones but still fall short of human judges on nuanced dimensions like empathy and commonsense. A new six-language meta-evaluation benchmark is built from synthetically generated data and validated through human annotations, enabling direct assessment of LLMs as evaluators across languages. MEDAL demonstrates extensibility to additional languages and models, offering a practical resource for advancing multilingual dialogue evaluation while highlighting limitations related to data generation, model biases, and evaluation reliability.
Abstract
Evaluating the quality of open-domain chatbots has become increasingly reliant on LLMs acting as automatic judges. However, existing meta-evaluation benchmarks are static, outdated, and lacking in multilingual coverage, limiting their ability to fully capture subtle weaknesses in evaluation. We introduce MEDAL, an automated multi-agent framework for curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. Then, a strong LLM (GPT-4.1) is used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. Using MEDAL, we uncover that state-of-the-art judges fail to reliably detect nuanced issues such as lack of empathy, commonsense, or relevance.
