Table of Contents
Fetching ...

MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators

John Mendonça, Alon Lavie, Isabel Trancoso

TL;DR

MEDAL introduces a scalable, multilingual framework for benchmarking LLMs as open-domain dialogue evaluators through a three-stage pipeline: native multilingual dialogue generation with a multi-agent setup (user, chatbot, judge), automated multidimensional evaluation by a strong LLM, and human-guided benchmark curation with annotations. The approach reveals cross-lingual differences in chatbot performance and evaluator capabilities, showing that reasoning LLM evaluators outperform non-reasoning ones but still fall short of human judges on nuanced dimensions like empathy and commonsense. A new six-language meta-evaluation benchmark is built from synthetically generated data and validated through human annotations, enabling direct assessment of LLMs as evaluators across languages. MEDAL demonstrates extensibility to additional languages and models, offering a practical resource for advancing multilingual dialogue evaluation while highlighting limitations related to data generation, model biases, and evaluation reliability.

Abstract

Evaluating the quality of open-domain chatbots has become increasingly reliant on LLMs acting as automatic judges. However, existing meta-evaluation benchmarks are static, outdated, and lacking in multilingual coverage, limiting their ability to fully capture subtle weaknesses in evaluation. We introduce MEDAL, an automated multi-agent framework for curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. Then, a strong LLM (GPT-4.1) is used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. Using MEDAL, we uncover that state-of-the-art judges fail to reliably detect nuanced issues such as lack of empathy, commonsense, or relevance.

MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators

TL;DR

MEDAL introduces a scalable, multilingual framework for benchmarking LLMs as open-domain dialogue evaluators through a three-stage pipeline: native multilingual dialogue generation with a multi-agent setup (user, chatbot, judge), automated multidimensional evaluation by a strong LLM, and human-guided benchmark curation with annotations. The approach reveals cross-lingual differences in chatbot performance and evaluator capabilities, showing that reasoning LLM evaluators outperform non-reasoning ones but still fall short of human judges on nuanced dimensions like empathy and commonsense. A new six-language meta-evaluation benchmark is built from synthetically generated data and validated through human annotations, enabling direct assessment of LLMs as evaluators across languages. MEDAL demonstrates extensibility to additional languages and models, offering a practical resource for advancing multilingual dialogue evaluation while highlighting limitations related to data generation, model biases, and evaluation reliability.

Abstract

Evaluating the quality of open-domain chatbots has become increasingly reliant on LLMs acting as automatic judges. However, existing meta-evaluation benchmarks are static, outdated, and lacking in multilingual coverage, limiting their ability to fully capture subtle weaknesses in evaluation. We introduce MEDAL, an automated multi-agent framework for curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. Then, a strong LLM (GPT-4.1) is used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. Using MEDAL, we uncover that state-of-the-art judges fail to reliably detect nuanced issues such as lack of empathy, commonsense, or relevance.

Paper Structure

This paper contains 74 sections, 18 figures, 9 tables.

Figures (18)

  • Figure 1: Overview of MEDAL, a 3-step multi-agent framework for curating native multilingual dialogue meta-evaluation benchmarks. ➊ Dialogue generation (§\ref{['sec:gen']}): Generation of multi-turn, multilingual synthetic dialogues seeded from diverse contexts, between an LLM acting as a user (with utterances validated by an LLM-judge within a feedback loop) and the target chatbot; ➋ Automated labelling (§\ref{['sec:eval']}): Multidimensional evaluation of all dialogues by a strong LLM (GPT-4.1); ➌ Benchmark Curation (§\ref{['sec:bench']}): Curation of meta-evaluation benchmarks using human annotations used to assess LLM judges.
  • Figure 2: Heatmap of detected issues and average overall quality determined by GPT-4 for all tested chatbots across six languages.
  • Figure 3: Partial English example from our curated benchmark, together with a failed evaluation conducted by Gemini-2.5-flash (with reasoning). Additional examples can be found in Appendix \ref{['sec:app_bench']}.
  • Figure 4: Spearman correlations across languages (all significant with $p<0.01$). Combined denotes the benchmark composed by all individual languages.
  • Figure 5: F1 scores for the positive label (F1+, detected issue) and negative label (F1-, no issue). Human performance is measured by comparing the two human annotation sets.
  • ...and 13 more figures