Table of Contents
Fetching ...

A Multi-Language Object-Oriented Programming Benchmark for Large Language Models

Shuai Wang, Liang Ding, Li Shen, Yong Luo, Han Hu, Lefei Zhang, Fu Lin

TL;DR

This paper targets fair evaluation of LLM-driven code generation across multiple languages by introducing MultiOOP, a multilingual object-oriented programming benchmark with $6$ languages and $267$ samples per language, built by translating a Python-based OOP benchmark and extending the pass@$o$ metric. It also presents an automated framework for generating and validating diverse test cases to improve reliability and reduce accidental passes. Empirical results across $14$ mainstream LLMs show substantial performance gaps between function-level benchmarks and MultiOOP, strong cross-language variability, and a persistent gap between pass@$k$ and pass@$o$ that reveals incomplete mastery of OOP concepts; few-shot prompting markedly boosts performance. The work contributes a robust, publicly released evaluation framework (code and data) to better assess multilingual OOP capabilities and guide future improvements in LLM-based code generation.

Abstract

Establishing fair and robust benchmarks is essential for evaluating intelligent code generation by large language models (LLMs). Our survey of 35 existing benchmarks uncovers three major imbalances: 85.7% focus on a single programming language; 94.3% target only function-level or statement-level tasks; and over 80% include fewer than ten test cases on average. To address these gaps, we propose MultiOOP, a multi-language object-oriented programming benchmark covering six popular languages (Python, PHP, C++, C#, Java, JavaScript) with 267 tasks per language. We design a translator that extends an existing single-language OOP benchmark and the pass@o metric to a multilingual setting. Moreover, we propose an automated framework for augmenting test cases to ensure the reliability of the evaluation results. We evaluate 14 mainstream LLMs under zero-shot prompting and report three key findings: 1) Substantial performance degradation: pass@1 scores on MultiOOP drop by up to 65.6 percentage points compared to function-level tasks (e.g., HumanEval). 2) Cross-language variability: GPT-4o mini achieves pass@1 of 48.06% in Python but only 0.12%-15.26% in other languages, indicating limited multilingual generalization. 3) Conceptual gaps: pass@o scores are consistently 1.1-19.2 points lower than pass@k, demonstrating that LLMs often generate executable code without fully capturing core OOP concepts. Our benchmark, metric extensions, and evaluation scripts will be publicly released to foster a more balanced and comprehensive assessment of LLMs in object-oriented code generation. Our code and data will be released at https://github.com/alphadl/OOP-eval and https://huggingface.co/datasets/codeai-dteam/MultiOOP respectively.

A Multi-Language Object-Oriented Programming Benchmark for Large Language Models

TL;DR

This paper targets fair evaluation of LLM-driven code generation across multiple languages by introducing MultiOOP, a multilingual object-oriented programming benchmark with languages and samples per language, built by translating a Python-based OOP benchmark and extending the pass@ metric. It also presents an automated framework for generating and validating diverse test cases to improve reliability and reduce accidental passes. Empirical results across mainstream LLMs show substantial performance gaps between function-level benchmarks and MultiOOP, strong cross-language variability, and a persistent gap between pass@ and pass@ that reveals incomplete mastery of OOP concepts; few-shot prompting markedly boosts performance. The work contributes a robust, publicly released evaluation framework (code and data) to better assess multilingual OOP capabilities and guide future improvements in LLM-based code generation.

Abstract

Establishing fair and robust benchmarks is essential for evaluating intelligent code generation by large language models (LLMs). Our survey of 35 existing benchmarks uncovers three major imbalances: 85.7% focus on a single programming language; 94.3% target only function-level or statement-level tasks; and over 80% include fewer than ten test cases on average. To address these gaps, we propose MultiOOP, a multi-language object-oriented programming benchmark covering six popular languages (Python, PHP, C++, C#, Java, JavaScript) with 267 tasks per language. We design a translator that extends an existing single-language OOP benchmark and the pass@o metric to a multilingual setting. Moreover, we propose an automated framework for augmenting test cases to ensure the reliability of the evaluation results. We evaluate 14 mainstream LLMs under zero-shot prompting and report three key findings: 1) Substantial performance degradation: pass@1 scores on MultiOOP drop by up to 65.6 percentage points compared to function-level tasks (e.g., HumanEval). 2) Cross-language variability: GPT-4o mini achieves pass@1 of 48.06% in Python but only 0.12%-15.26% in other languages, indicating limited multilingual generalization. 3) Conceptual gaps: pass@o scores are consistently 1.1-19.2 points lower than pass@k, demonstrating that LLMs often generate executable code without fully capturing core OOP concepts. Our benchmark, metric extensions, and evaluation scripts will be publicly released to foster a more balanced and comprehensive assessment of LLMs in object-oriented code generation. Our code and data will be released at https://github.com/alphadl/OOP-eval and https://huggingface.co/datasets/codeai-dteam/MultiOOP respectively.

Paper Structure

This paper contains 39 sections, 3 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: An example from the HumanEval benchmark cassano2023multipl and the MBPP benchmark austin2021program. The HumanEval benchmark task includes both code skeletons and requirement descriptions, while the MBPP benchmark only includes requirement descriptions.
  • Figure 2: An example of a single-language object-oriented programming benchmark. The JavaBench benchmark cao2024javabench task includes both code skeletons and requirement descriptions, while the OOP benchmark wang-etal-2024-oop only includes requirement descriptions.
  • Figure 3: The process of constructing a MultiOOP evaluation benchmark (Here, we take C++ as reference. The same is true when translating Python to other programming languages i.e., Java, C#, PHP, and JavaScript. The construction of the MultiOOP benchmark is mainly divided into three stages: translation of requirement description, unit tests, and matching function).
  • Figure 4: An automated framework for generating test cases.
  • Figure 5: Error distribution of Python and C++ programming languages.
  • ...and 8 more figures