Table of Contents
Fetching ...

MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language

Seyoung Song, Seogyeong Jeong, Eunsu Kim, Jiho Jin, Dongkwan Kim, Jay Shin, Alice Oh

TL;DR

MuG-Eval presents a language-agnostic framework to evaluate multilingual generation by recasting benchmarks as information-gap conversational tasks. It employs three tasks—Easy Twenty Questions, MCQ Conversation, and Code Reconstruction—to measure generation ability via task completion rates, avoiding language-specific tools and LLM evaluators. Across 8 LLMs and 30 languages, MuG-Eval shows strong alignment with established multilingual benchmarks (Pearson/Spearman $r>0.75$) and reveals nuanced cross-language patterns, including the limited transferability of English-only substitutes for low-resource languages. The framework demonstrates scalability and resource efficiency, with insights into task-specific discriminative power, substitution effects, and qualitative error patterns, while acknowledging limitations in measuring linguistic quality and the need for broader human validation.

Abstract

Evaluating text generation capabilities of large language models (LLMs) is challenging, particularly for low-resource languages where methods for direct assessment are scarce. We propose MUG-Eval, a novel framework that evaluates LLMs' multilingual generation capabilities by transforming existing benchmarks into conversational tasks and measuring the LLMs' accuracies on those tasks. We specifically designed these conversational tasks to require effective communication in the target language. Then, we simply use task success rate as a proxy for successful conversation generation. Our approach offers two key advantages: it is independent of language-specific NLP tools or annotated datasets, which are limited for most languages, and it does not rely on LLMs-as-judges, whose evaluation quality degrades outside a few high-resource languages. We evaluate 8 LLMs across 30 languages spanning high, mid, and low-resource categories, and we find that MUG-Eval correlates strongly with established benchmarks ($r$ > 0.75) while enabling standardized comparisons across languages and models. Our framework provides a robust and resource-efficient solution for evaluating multilingual generation that can be extended to thousands of languages.

MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language

TL;DR

MuG-Eval presents a language-agnostic framework to evaluate multilingual generation by recasting benchmarks as information-gap conversational tasks. It employs three tasks—Easy Twenty Questions, MCQ Conversation, and Code Reconstruction—to measure generation ability via task completion rates, avoiding language-specific tools and LLM evaluators. Across 8 LLMs and 30 languages, MuG-Eval shows strong alignment with established multilingual benchmarks (Pearson/Spearman ) and reveals nuanced cross-language patterns, including the limited transferability of English-only substitutes for low-resource languages. The framework demonstrates scalability and resource efficiency, with insights into task-specific discriminative power, substitution effects, and qualitative error patterns, while acknowledging limitations in measuring linguistic quality and the need for broader human validation.

Abstract

Evaluating text generation capabilities of large language models (LLMs) is challenging, particularly for low-resource languages where methods for direct assessment are scarce. We propose MUG-Eval, a novel framework that evaluates LLMs' multilingual generation capabilities by transforming existing benchmarks into conversational tasks and measuring the LLMs' accuracies on those tasks. We specifically designed these conversational tasks to require effective communication in the target language. Then, we simply use task success rate as a proxy for successful conversation generation. Our approach offers two key advantages: it is independent of language-specific NLP tools or annotated datasets, which are limited for most languages, and it does not rely on LLMs-as-judges, whose evaluation quality degrades outside a few high-resource languages. We evaluate 8 LLMs across 30 languages spanning high, mid, and low-resource categories, and we find that MUG-Eval correlates strongly with established benchmarks ( > 0.75) while enabling standardized comparisons across languages and models. Our framework provides a robust and resource-efficient solution for evaluating multilingual generation that can be extended to thousands of languages.

Paper Structure

This paper contains 50 sections, 10 figures, 18 tables.

Figures (10)

  • Figure 1: General concept of MuG-Eval. Two instances of the same LLM engage in self-communication in the target language to complete information-gap tasks. Model outputs are evaluated using algorithmic methods (e.g., string matching or code testing), without requiring language-specific tools or LLMs-as-judges. Task success rate serves as a proxy for measuring the model's multilingual generation capability.
  • Figure 2: Overview of evaluation tasks. Two instances of the same LLM engage in self-communication in the target language to complete information-gap tasks: (1) Easy Twenty Questions---guessing a hidden word, (2) MCQ Conversation---finding the answer through passage-based dialogue, and (3) Code Reconstruction---explaining and reconstructing code.
  • Figure 3: Accuracy of 8 LLMs across three tasks in 30 languages. Languages are grouped by resource level and sorted by average performance within each group. Results show that Code Reconstruction is the easiest task, followed by MCQ Conversation and Easy Twenty Questions. The gap is minor between high and mid-resource languages, but substantial between mid and low. Larger models consistently outperform smaller ones within the same language family, and tasks exhibit distinct ceiling effect.
  • Figure 4: Score distributions across six evaluation tasks, demonstrating varying discriminative powers. Notably, MCQ Conversation, derived from the Belebele task, exhibits greater statistical dispersion, indicating greater ability to distinguish between models than the original Belebele benchmark.
  • Figure 5: Correlation analysis between MuG-Eval tasks and existing multilingual benchmarks. Heatmaps show Pearson's $r$ (left) and Spearman's $\rho$ (right) correlation coefficients between three MuG-Eval tasks and three established benchmarks. All correlations exceed 0.75, demonstrating strong consistency between MuG-Eval and existing evaluation methods, validating its effectiveness as a multilingual evaluation framework.
  • ...and 5 more figures