OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models

Shuai Wang; Liang Ding; Li Shen; Yong Luo; Bo Du; Dacheng Tao

OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models

Shuai Wang, Liang Ding, Li Shen, Yong Luo, Bo Du, Dacheng Tao

TL;DR

<3-5 sentence high-level summary> The paper identifies a gap in code-generation benchmarks that underrepresent object-oriented programming (OOP). It introduces a Python-based OOP benchmark with 431 tasks and a novel pass@$o$ metric to specifically evaluate OOP concept generation, alongside evaluation of 23 LLMs. Results show that even strong code-focused models struggle with OOP tasks and that pass@$k$ can misrepresent true OOP capability, highlighting the need for targeted improvements and prompting strategies. The authors publicly release the benchmark and scripts to drive community progress in improving LLMs’ OOP understanding and generation, particularly for private and encapsulated constructs.

Abstract

Advancing automated programming necessitates robust and comprehensive code generation benchmarks, yet current evaluation frameworks largely neglect object-oriented programming (OOP) in favor of functional programming (FP), e.g., HumanEval and MBPP. To address this, our study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs that encompass essential OOP concepts and features like classes and encapsulation methods. We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures. Our evaluation of 23 leading large language models (LLMs), including both general and code-specialized models, reveals three key insights: 1) pass@o offers a more relevant and comprehensive assessment for OOP code generation; 2) Despite excelling in FP, code-specialized LLMs like WizardCoder lag in OOP compared to models like ChatGPT; 3) The poor performance of all advanced LLMs on our OOP benchmark highlights a critical need for improvements in this field. Our benchmark and scripts are publicly released at: https://github.com/alphadl/OOP-eval.

OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models

TL;DR

metric to specifically evaluate OOP concept generation, alongside evaluation of 23 LLMs. Results show that even strong code-focused models struggle with OOP tasks and that pass@

can misrepresent true OOP capability, highlighting the need for targeted improvements and prompting strategies. The authors publicly release the benchmark and scripts to drive community progress in improving LLMs’ OOP understanding and generation, particularly for private and encapsulated constructs.

Abstract

Paper Structure (36 sections, 3 equations, 15 figures, 7 tables)

This paper contains 36 sections, 3 equations, 15 figures, 7 tables.

Introduction
Related work
Code Evaluation Benchmark
Code Evaluation Metrics
Evaluation Framework
Overview
Building OOP Benchmarks
Data Filtering.
Human Rewritten.
Case Design.
Level Classification.
Evaluation Metrics Pass@$o$
Experiments
Experimental Setup
Evaluated LLMs
...and 21 more sections

Figures (15)

Figure 1: The performance comparison of widely-used code language models on functional programming (FP) and object-oriented programming (OOP) code generation benchmarks, in terms of pass@$1$ scores. We see that all models perform relatively well on FP benchmarks, i.e., Humaneval chen2021evaluating and MBPP austin2021program, while exhibiting poor performance on our OOP benchmark.
Figure 2: The generation of private functions cannot be evaluated using pass@$k$. (We instructed ChatGPT NEURIPS2022_b1efde53openai2023gpt4 model to generate the class class SS, public function public_Shortest_subarray, and private function def __private_Shortest_subarray based on a given prompt and implement the corresponding requirements within the functions. However, ChatGPT does not generate the private functions named private_Shortest_subarray outlined in the red box.)
Figure 3: The construction process of our object-oriented programming (OOP) benchmark.
Figure 4: The case comparison of generation results between Qwen-14b and WizardCoder-15b in the OOP benchmark. We see: 1) Qwen-14b can accurately generate private functions, while WizardCoder-15b cannot accurately generate private functions; 2) The results generated by Qwen-14b and WizardCoder-15b can both pass the evaluation using pass@$k$; 3) The results generated by Qwen-14b can pass the evaluation using pass@$o$, but the results generated by WizardCoder-15b cannot pass the evaluation using pass@$o$.
Figure 5: Distribution of search results for ChatGPT and CodeLlama-34b. (In program, "class" serves as the indicator for program class names. If the program does not contain a "class", it signifies an error in the generation of class names by the LLM. Similarly, it can be deduced that "def _" and "def __" serve as indicators for private function names; "def" signifies a public function name; and "def __init__" represents the indicator for attribute variables name. Moreover, In our OOP benchmark, the LLM should ideally generate at least 86,200 "class", 36,000 "def __" or "def _", 86,200 "def", and 70,800 "def __init__".)
...and 10 more figures

OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models

TL;DR

Abstract

OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)