A Multi-Language Object-Oriented Programming Benchmark for Large Language Models
Shuai Wang, Liang Ding, Li Shen, Yong Luo, Han Hu, Lefei Zhang, Fu Lin
TL;DR
This paper targets fair evaluation of LLM-driven code generation across multiple languages by introducing MultiOOP, a multilingual object-oriented programming benchmark with $6$ languages and $267$ samples per language, built by translating a Python-based OOP benchmark and extending the pass@$o$ metric. It also presents an automated framework for generating and validating diverse test cases to improve reliability and reduce accidental passes. Empirical results across $14$ mainstream LLMs show substantial performance gaps between function-level benchmarks and MultiOOP, strong cross-language variability, and a persistent gap between pass@$k$ and pass@$o$ that reveals incomplete mastery of OOP concepts; few-shot prompting markedly boosts performance. The work contributes a robust, publicly released evaluation framework (code and data) to better assess multilingual OOP capabilities and guide future improvements in LLM-based code generation.
Abstract
Establishing fair and robust benchmarks is essential for evaluating intelligent code generation by large language models (LLMs). Our survey of 35 existing benchmarks uncovers three major imbalances: 85.7% focus on a single programming language; 94.3% target only function-level or statement-level tasks; and over 80% include fewer than ten test cases on average. To address these gaps, we propose MultiOOP, a multi-language object-oriented programming benchmark covering six popular languages (Python, PHP, C++, C#, Java, JavaScript) with 267 tasks per language. We design a translator that extends an existing single-language OOP benchmark and the pass@o metric to a multilingual setting. Moreover, we propose an automated framework for augmenting test cases to ensure the reliability of the evaluation results. We evaluate 14 mainstream LLMs under zero-shot prompting and report three key findings: 1) Substantial performance degradation: pass@1 scores on MultiOOP drop by up to 65.6 percentage points compared to function-level tasks (e.g., HumanEval). 2) Cross-language variability: GPT-4o mini achieves pass@1 of 48.06% in Python but only 0.12%-15.26% in other languages, indicating limited multilingual generalization. 3) Conceptual gaps: pass@o scores are consistently 1.1-19.2 points lower than pass@k, demonstrating that LLMs often generate executable code without fully capturing core OOP concepts. Our benchmark, metric extensions, and evaluation scripts will be publicly released to foster a more balanced and comprehensive assessment of LLMs in object-oriented code generation. Our code and data will be released at https://github.com/alphadl/OOP-eval and https://huggingface.co/datasets/codeai-dteam/MultiOOP respectively.
