mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

Nishat Raihan; Antonios Anastasopoulos; Marcos Zampieri

mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

Nishat Raihan, Antonios Anastasopoulos, Marcos Zampieri

TL;DR

mHumanEval introduces a massively multilingual benchmark for code generation, expanding the scope beyond English-to-Python to 204 natural languages and 25 programming languages, with expert translations for 15 languages. The authors build a rigorous translation-and-quality-assessment pipeline using GPT-4o, NLLB, and Google Translate, guided by BERTScore and CometKiwi metrics, and categorize languages by resource class to analyze performance across NL resource levels. Evaluations of six diverse LLMs reveal that GPT-4o and Claude-3.5 provide the most robust cross-language code generation, while non-English prompts, especially in low-resource languages, show notable declines for several models; results also emphasize the importance of multilingual pretraining and data diversity. The work offers extensive subsets and an expert-translated variant to support rapid prototyping and future benchmarking, highlighting practical implications for deploying multilingual code-generation systems and guiding future research toward broader language coverage and transfer learning strategies.

Abstract

Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However, this and other Code LLM benchmarks face critical limitations, particularly in task diversity, test coverage, and linguistic scope. Current evaluations primarily focus on English-to-Python conversion tasks with limited test cases, potentially overestimating model performance. While recent works have addressed test coverage and programming language (PL) diversity, code generation from low-resource language prompts remains largely unexplored. To address this gap, we introduce mHumanEval, an extended benchmark supporting prompts in over 200 natural languages. We employ established machine translation methods to compile the benchmark, coupled with a quality assurance process. Furthermore, we provide expert human translations for 15 diverse natural languages (NLs). We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs, offering insights into the current landscape of cross-lingual code generation.

mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

TL;DR

Abstract

mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (21)