Table of Contents
Fetching ...

mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

Nishat Raihan, Antonios Anastasopoulos, Marcos Zampieri

TL;DR

mHumanEval introduces a massively multilingual benchmark for code generation, expanding the scope beyond English-to-Python to 204 natural languages and 25 programming languages, with expert translations for 15 languages. The authors build a rigorous translation-and-quality-assessment pipeline using GPT-4o, NLLB, and Google Translate, guided by BERTScore and CometKiwi metrics, and categorize languages by resource class to analyze performance across NL resource levels. Evaluations of six diverse LLMs reveal that GPT-4o and Claude-3.5 provide the most robust cross-language code generation, while non-English prompts, especially in low-resource languages, show notable declines for several models; results also emphasize the importance of multilingual pretraining and data diversity. The work offers extensive subsets and an expert-translated variant to support rapid prototyping and future benchmarking, highlighting practical implications for deploying multilingual code-generation systems and guiding future research toward broader language coverage and transfer learning strategies.

Abstract

Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However, this and other Code LLM benchmarks face critical limitations, particularly in task diversity, test coverage, and linguistic scope. Current evaluations primarily focus on English-to-Python conversion tasks with limited test cases, potentially overestimating model performance. While recent works have addressed test coverage and programming language (PL) diversity, code generation from low-resource language prompts remains largely unexplored. To address this gap, we introduce mHumanEval, an extended benchmark supporting prompts in over 200 natural languages. We employ established machine translation methods to compile the benchmark, coupled with a quality assurance process. Furthermore, we provide expert human translations for 15 diverse natural languages (NLs). We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs, offering insights into the current landscape of cross-lingual code generation.

mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

TL;DR

mHumanEval introduces a massively multilingual benchmark for code generation, expanding the scope beyond English-to-Python to 204 natural languages and 25 programming languages, with expert translations for 15 languages. The authors build a rigorous translation-and-quality-assessment pipeline using GPT-4o, NLLB, and Google Translate, guided by BERTScore and CometKiwi metrics, and categorize languages by resource class to analyze performance across NL resource levels. Evaluations of six diverse LLMs reveal that GPT-4o and Claude-3.5 provide the most robust cross-language code generation, while non-English prompts, especially in low-resource languages, show notable declines for several models; results also emphasize the importance of multilingual pretraining and data diversity. The work offers extensive subsets and an expert-translated variant to support rapid prototyping and future benchmarking, highlighting practical implications for deploying multilingual code-generation systems and guiding future research toward broader language coverage and transfer learning strategies.

Abstract

Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However, this and other Code LLM benchmarks face critical limitations, particularly in task diversity, test coverage, and linguistic scope. Current evaluations primarily focus on English-to-Python conversion tasks with limited test cases, potentially overestimating model performance. While recent works have addressed test coverage and programming language (PL) diversity, code generation from low-resource language prompts remains largely unexplored. To address this gap, we introduce mHumanEval, an extended benchmark supporting prompts in over 200 natural languages. We employ established machine translation methods to compile the benchmark, coupled with a quality assurance process. Furthermore, we provide expert human translations for 15 diverse natural languages (NLs). We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs, offering insights into the current landscape of cross-lingual code generation.

Paper Structure

This paper contains 58 sections, 7 equations, 21 figures, 27 tables, 1 algorithm.

Figures (21)

  • Figure 1: Code snippet generated by GPT3.5 when prompted to write a Python code to detect leap years in Nyanja language. Some Python keywords are transformed into Nyanja words, resulting in compilation issues.
  • Figure 2: The workflow to generate prompts in a target language from the original HumanEval. Original prompts are first extracted manually. Then 3 Machine Translation models (GPT4o, NLLB, Google Translate) generate 13 candidates as well as roundtrip translations. Next, we evaluate each candidate's quality using BERTScore using RoundTrip translations and CometKiwi as a reference-free metric (if the language is supported). We then select the best candidate for each original prompt and compile the new benchmark for the target language.
  • Figure 3: A sample prompt instance from the original HumanEval benchmark.
  • Figure 4: Evaluating the translated prompt qualities, chosen in mHumanEval. Our method results in better quality prompts compared to the model-specific translations (as depicted in Appendix \ref{['app:comparison']}).
  • Figure 5: Curating mHumanEval-Expert via native human translation followed by expert programmer evaluation.
  • ...and 16 more figures