OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models
Shuai Wang, Liang Ding, Li Shen, Yong Luo, Bo Du, Dacheng Tao
TL;DR
<3-5 sentence high-level summary> The paper identifies a gap in code-generation benchmarks that underrepresent object-oriented programming (OOP). It introduces a Python-based OOP benchmark with 431 tasks and a novel pass@$o$ metric to specifically evaluate OOP concept generation, alongside evaluation of 23 LLMs. Results show that even strong code-focused models struggle with OOP tasks and that pass@$k$ can misrepresent true OOP capability, highlighting the need for targeted improvements and prompting strategies. The authors publicly release the benchmark and scripts to drive community progress in improving LLMs’ OOP understanding and generation, particularly for private and encapsulated constructs.
Abstract
Advancing automated programming necessitates robust and comprehensive code generation benchmarks, yet current evaluation frameworks largely neglect object-oriented programming (OOP) in favor of functional programming (FP), e.g., HumanEval and MBPP. To address this, our study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs that encompass essential OOP concepts and features like classes and encapsulation methods. We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures. Our evaluation of 23 leading large language models (LLMs), including both general and code-specialized models, reveals three key insights: 1) pass@o offers a more relevant and comprehensive assessment for OOP code generation; 2) Despite excelling in FP, code-specialized LLMs like WizardCoder lag in OOP compared to models like ChatGPT; 3) The poor performance of all advanced LLMs on our OOP benchmark highlights a critical need for improvements in this field. Our benchmark and scripts are publicly released at: https://github.com/alphadl/OOP-eval.
