OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large Language Models

Yang Liu; Meng Xu; Shuo Wang; Liner Yang; Haoyu Wang; Zhenghao Liu; Cunliang Kong; Yun Chen; Yang Liu; Maosong Sun; Erhong Yang

OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large Language Models

Yang Liu, Meng Xu, Shuo Wang, Liner Yang, Haoyu Wang, Zhenghao Liu, Cunliang Kong, Yun Chen, Yang Liu, Maosong Sun, Erhong Yang

TL;DR

OMGEval introduces a first open-source multilingual generative evaluation benchmark for LLMs, providing 804 open-ended questions per language across Zh, Ru, Fr, Es, and Ar with localization, human verification, and GPT-4 adjudication. The data collection pipeline combines preliminary GPT-4 translations, careful manual localization of culturally specific elements, and rigorous evaluation with win-rate scoring to reflect real-world usage. Experimental results show GPT-4 substantially outperforms open-source multilingual models (with Guanaco-13b as the strongest open-source), while localization subsets reveal language-specific challenges and the need for broader cultural coverage. Overall, OMGEval offers a practical, culturally aware benchmark to diagnose multilingual capabilities and guide improvements in non-English LLM performance and fairness.

Abstract

Modern large language models (LLMs) should generally benefit individuals from various cultural backgrounds around the world. However, most recent advanced generative evaluation benchmarks tailed for LLMs mainly focus on English. To this end, we introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages. For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs, such as general knowledge, logical reasoning, and so on. Each question is rigorously verified by human annotators. Notably, to sufficiently reflect the compatibility of LLMs in different cultural backgrounds, we perform localization for each non-English language. Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar). Following AlpacaEval, we employ GPT-4 as the adjudicator to automatically score different model outputs, which is shown closely related to human evaluation. We evaluate several representative multilingual LLMs on the proposed OMGEval, which we believe will provide a valuable reference for the community to further understand and improve the multilingual capability of LLMs. OMGEval is available at https://github.com/blcuicall/OMGEval.

OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large Language Models

TL;DR

Abstract

Paper Structure (30 sections, 7 figures, 5 tables)

This paper contains 30 sections, 7 figures, 5 tables.

Introduction
Backgroud
Importance of Multilingo Evaluation
Applicability and Generalizability
Combating Cultural Hegemony
Real-World Scenario Simulation
Necessity of Generative Evaluation
Complexity and Range of Outputs
Data Collection
Preliminary Translation
Manual Localization
Manual Verification
Evaluation
Data Analysis
Capability Type
...and 15 more sections

Figures (7)

Figure 1: Example for question localization, the language-specific items are highlighted in different colors. In different cultural contexts, discussions about the same topic can vary significantly. For instance, when talking about festivals and food, Americans might focus on Thanksgiving and turkey, while Chinese people may discuss the Dragon Boat Festival and Zongzi.
Figure 2: An question that requires LLMs to complete the proverb given the prefix. Proverbs in different languages are diverse and may be difficult to understand without the knowledge of the corresponding language.
Figure 3: Construction process of OMGEval.
Figure 4: Illustration of the evaluation process.
Figure 5: Distribution of questions in OMGEval: General Knowledge (27.0%), Professional Knowledge (26.7%), Generation and creation (17.8%), Language Comprehension (9.0%), Code Skills (6.1%), Logical Reasoning (5.5%), Maths Competence (4.0%), Chit chat (2.9%), Harmlessness (1.1%).
...and 2 more figures

OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large Language Models

TL;DR

Abstract

OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)